Outcome-Oriented Predictive Process Monitoring: Review and Benchmark

07/21/2017 ∙ by Irene Teinemaa, et al. ∙ University of Tartu The University of Melbourne 0

Predictive business process monitoring refers to the act of making predictions about the future state of ongoing cases of a business process, based on their incomplete execution traces and logs of historical (completed) traces. Motivated by the increasingly pervasive availability of fine-grained event data about business process executions, the problem of predictive process monitoring has received substantial attention in the past years. In particular, a considerable number of methods have been put forward to address the problem of outcome-oriented predictive process monitoring, which refers to classifying each ongoing case of a process according to a given set of possible outcomes - e.g. Will the customer complain or not? Will an order be delivered, cancelled or withdrawn? Unfortunately, different authors have used different datasets, experimental settings, evaluation measures and baselines to assess their proposals, resulting in poor comparability and an unclear picture of the relative merits and applicability of different methods. To address this gap, this article presents a systematic review and taxonomy of outcome-oriented predictive process monitoring methods, and a comparative experimental evaluation of eleven representative methods using a benchmark covering twelve predictive process monitoring tasks based on four real-life event logs.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 17

Code Repositories

predictive-monitoring-benchmark

Benchmark evaluation for outcome-based predictive monitoring


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Business process monitoring is the act of analyzing events produced by the executions of a business process at runtime, in order to understand its performance and its conformance with respect to a set of business goals (Dumas et al., 2013). Traditional process monitoring techniques provide dashboards and reports showing the recent performance of a business process in terms of key performance indicators such as mean execution time, resource utilization or error rate with respect to a given notion of error.

Predictive (business) process monitoring techniques go beyond traditional ones by making predictions about the future state of the executions of a business process (herein called cases). For example, a predictive monitoring technique may seek to predict the remaining execution time of each ongoing case of a process (Rogge-Solti and Weske, 2013), the next activity that will be executed in each case (Evermann et al., 2016), or the final outcome of a case, with respect to a possible set of business outcomes (Metzger et al., 2012; Maggi et al., 2014; Metzger et al., 2015). For instance, in an order-to-cash process (a process going from the receipt of a purchase order to the receipt of payment of the corresponding invoice), the possible outcomes of a case may be that the purchase order is closed satisfactorily (i.e., the customer accepted the products and paid) or unsatisfactorily (e.g., the order was canceled or withdrawn). Another set of possible outcomes is that the products were delivered on time (with respect to a maximum acceptable delivery time), or delivered late.

Recent years have seen the emergence of a rich field of proposed methods for predictive process monitoring in general, and predictive monitoring of (categorical) case outcomes in particular – herein called outcome-oriented predictive process monitoring. Unfortunately, there is no unified approach to evaluate these methods. Indeed, different authors have used different datasets, experimental settings, evaluation measures and baselines.

This paper aims at filling this gap by (i) performing a systematic literature review of outcome-oriented predictive process monitoring methods; (ii) providing a taxonomy of existing methods; and (iii) performing a comparative experimental evaluation of eleven representative methods, using a benchmark of 24 predictive monitoring tasks based on nine real-life event logs.

The contribution of this study is a categorized collection of outcome-oriented predictive process monitoring methods and a benchmark designed to enable researchers to empirically compare new methods against existing ones in a unified setting. The benchmark is provided as an open-source framework that allows researchers to run the entire benchmark with minimal effort, and to configure and extend it with additional methods and datasets.

The rest of the paper is structured as follows. Section 2 introduces some basic concepts and definitions. Section 3 describes the search and selection of relevant studies. Section 4 surveys the selected studies and provides a taxonomy to classify them. Section 5 reports on benchmark evaluation of the selected studies while Section 6 discusses threats to validity. Finally, Section 7 summarizes the findings and outlines directions for future work.

2. Background

The starting point of predictive process monitoring are event records representing the execution of activities in a business process. An event record has a number of attributes. Three of these attributes are present in every event record, namely the event class (a.k.a. activity name) specifying which activity the event refers to, the timestamp specifying when did the event occur, and the case id indicating which case of the process generated this event. In other words, every event represents the occurrence of an activity at a particular point in time and in the context of a given case. An event record may carry additional attributes in its payload. These are called event-specific attributes (or event attributes for short). For example, in an order-to-cash process, the amount of the invoice may be recorded as an attribute of an event referring to activity “Create invoice”. Other attributes, namely case attributes, belong to the case and are hence shared by all events generated by the same case. For example in an order-to-cash process, the customer identifier is likely to be a case attribute. If so, this attribute will appear in every event of every case of the order-to-cash process, and it has the same value for all events generated by a given case. In other words, the value of a case attribute is static, i.e., it does not change throughout the lifetime of a case, as opposed to attributes in the event payload, which are dynamic as they change from an event to the other.

Formally, an event record is defined as follows:

Definition 2.1 (Event).

An is a tuple where is the activity name, is the case id, is the timestamp and (where ) are the event or case attributes and their values.

Herein, we use the term event as a shorthand for event record. The universe of all events is hereby denoted by .

The sequence of events generated by a given case forms a trace. Formally:

Definition 2.2 (Trace).

A is a non-empty sequence of events such that and . In other words, all events in the trace refer to the same case.

The universe of all possible traces is denoted by .

A set of completed traces (i.e., traces recording the execution of completed cases) is called an event log.

As a running example, we consider a simple log of a patient treatment process containing two cases (cf. Figure 1). The activity name of the first event in trace is consultation, it refers to case and occurred at 10:30AM. The additional event attributes show that the cost of the procedure was and the activity was performed in the radiotherapy department. These two are event attributes. Note that not all events carry every possible event attribute. For example, the first event of trace does not have the attribute amountPaid. In other words, the set of event attributes can differ from one event to another even within the same trace. The events in each trace also carry two case attributes: the age of the patient and the gender. The latter attributes have the same value for all events of a trace.

[(consultation, 1, 10:30AM, (age, 33), (gender, female), (amountPaid, 10), (department, radiotherapy)), …,
(ultrasound, 1, 10:55AM, (age, 33), (gender, female), (amountPaid, 15), (department, NursingWard))]
[(order blood, 2, 12:30PM, (age, 56), (gender, male), (department, GeneralLab), …,
(payment, 2, 2:30PM, (age, 56), (gender, male), (amountPaid, 100), (deparment, FinancialDept))]
Figure 1. Extract of an event log.

An event or a case attribute can be of numeric, categorical, or of textual data type. Each data type requires different preprocessing to be usable by the classifier. With respect to the running example, possible event and case attributes and their type are presented in Table 1.

Type Example
Case (static)
 categorical patient’s gender
 numeric patient’s age
 textual description of the application
Event (dynamic)
 categorical activity, resource
 numeric amount paid
 textual patient’s medical history
Table 1. Data attributes in the event log

In predictive process monitoring, we aim at making predictions for traces of incomplete cases, rather than for traces of completed cases. Therefore, we make use of a function that returns the first events of a trace of a (completed) case.

Definition 2.3 (Prefix function).

Given a trace and a positive integer , .

Given a trace, outcome-oriented predictive process monitoring aims at predicting its class label (expressing its outcome according to some business goal), given a set of completed cases with their known class labels.

Definition 2.4 (Labeling function).

A labeling function is a function that maps a trace to its class label with being the domain of the class labels. For outcome predictions, is a finite set of categorical outcomes. For example, for a binary outcome .

Predictions are made using a classifier that takes as input a fixed number of independent variables (herein called features

) and learns a function to estimate the dependent variable (class label). This means that in order to use the data in an event log as input of a classifier, each trace in the log must be

encoded

as a feature vector.

Definition 2.5 (Sequence/trace encoder).

A sequence (or trace) encoder is a function that takes a (partial) trace and transforms it into a feature vector in the -dimensional vector space with being the domain of the -th feature.

The features extracted from a trace may encode information on activities performed during the execution of a trace and their order (herein called

control-flow features), and features that correspond to event/case attributes (herein referred to as data payload features).

A classifier is a function that assigns a class label to a feature vector.

Definition 2.6 (Classifier).

A classifier is a function that takes an encoded -dimensional sequence and estimates its class label.

The construction of a classifier (a.k.a. classifier training) for outcome-oriented predictive process monitoring is achieved by applying a classification algorithm over a set of prefixes of an event log. Accordingly, given a log , we define its prefix log log to be the event log that contains all prefixes of , i.e., . Since the main aim of predictive process monitoring is to make predictions as early as possible (rather than when a case is about to complete), we often focus on the subset of the prefix log containing traces of up to a given length. Accordingly, we define the length-filtered prefix log to be the subset of containing only prefixes of size less than or equal to .

With respect to the broader literature on machine learning, we note that predictive process monitoring corresponds to a problem of

early sequence classification. In other words, given a set of labeled sequences, the goal is to build a model that for a sequence prefix predicts the label this prefix will get when completed. A survey on sequence classification presented in (Xing and Pei, 2010) provides an overview of techniques in this field. This latter survey noted that, while there is substantial literature on the problem of sequence classification for simple symbolic sequences (e.g., sequences of events without payloads), there is a lack of proposals addressing the problem for complex symbolic sequences (i.e., sequences of events with payloads). The problem of outcome-oriented predictive process monitoring can be seen as an early classification over complex sequences where each element has a timestamp, a discrete attribute referring to an activity, and a payload made of a heterogeneous set of other attributes.

3. Search methodology

In order to retrieve and select studies for our survey and benchmark, we conducted a Systematic Literature Review (SLR) according to the approach described in (Kitchenham, 2004). We started by specifying the research questions. Next, guided by these goals, we developed relevant search strings for querying a database of academic papers. We applied inclusion and exclusion criteria to the retrieved studies in order to filter out irrelevant ones, and last, we divided all relevant studies into primary and subsumed ones based on their contribution.

3.1. Research questions

The purpose of this survey is to define a taxonomy of methods for outcome-oriented predictive monitoring of business processes. The decision to focus on outcome-oriented predictive monitoring is to have a well-delimited and manageable scope, given the richness of the literature in the broader field of predictive process monitoring, and the fact that other predictive process monitoring tasks rely on entirely different techniques and evaluation measures.

In line with the selected scope, the survey focuses specifically on the following research question:

  1. [label=RQ0]

  2. Given an event log of completed business process execution cases and the final outcome (class) of each case, how to train a model that can accurately and efficiently predict the outcome of an incomplete (partial) trace, based on the given prefix only?

We then decomposed this overarching question into the following subquestions:

  1. [label=RQ0]

  2. What methods exist for predictive outcome-oriented monitoring of business processes?

  3. How to categorize these methods in a taxonomy?

  4. What is the relative performance of these methods?

In the following subsections, we describe our approach to identifying existing methods for predictive outcome-oriented process monitoring (RQ1). Subsequent sections address the other two research questions.

3.2. Study retrieval

First, we came up with relevant keywords according to the research question of predictive outcome-oriented process monitoring (RQ1) and our knowledge of the subject. We considered the following keywords relevant:

  • “(business) process” — a relevant study must take as input an event log of business process execution data;

  • “monitoring” — a relevant study should concern run-time monitoring of business processes, i.e., work with partial (running) traces;

  • “prediction” — a relevant study needs to estimate what will happen in the future, rather than monitor what has already happened.

We deliberately left out “outcome” from the set of keywords. The reason for this is that we presumed that different authors might use different words to refer to this prediction target. Therefore, in order to obtain a more exhaustive set of relevant papers, we decided to filter out studies that focus on other prediction targets (rather than the final outcome) in an a-posteriori filtering phase.

Based on these selected keywords, we constructed three search phrases: “predictive process monitoring”, “predictive business process monitoring”, and “business process prediction”. We applied these search strings to the Google Scholar academic database and retrieved all studies that contained at least one of the phrases in the title, keywords, abstract, or the full text of the paper. We used Google Scholar, a well-known electronic literature database, as it encompasses all relevant databases such as ACM Digital Library and IEEE Xplore, and also allows searching within the full text of a paper.

The search was conducted in August 2017 and returned 93 papers, excluding duplicates.

3.3. Study selection

All the retrieved studies were matched against several inclusion and exclusion criteria to further determine their relevance to predictive outcome-oriented process monitoring. In order to be considered relevant, a study must satisfy all of the inclusion criteria and none of the exclusion criteria.

The assessment of each study was performed independently by two authors of this paper, and the results were compared to resolve inconsistencies with the mediation of a third author.

3.3.1. Inclusion criteria

The inclusion criteria are designed for assessing the relevance of studies in a superficial basis. Namely, these criteria are checked without working through the full text of the paper. The following inclusion criteria were applied to the retrieved studies:

  1. [label=IN0]

  2. The study is concerned with predictions in the context of business processes (this criterion was assessed by reading title and abstract).

  3. The study is cited at least five times.

The application of these inclusion criteria to the original set of retrieved papers resulted in eight relevant studies. We proceeded with one-hop-snowballing, i.e., we retrieved the papers that are related to (cite or are cited by) these eight studies and applied the same inclusion criteria. This procedure resulted in 545 papers, of which we retained 70 unique papers after applying the inclusion criteria.111All retrieved papers that satisfy the inclusion criteria can be found at http://bit.ly/2uspLRp

3.3.2. Exclusion criteria

The list of studies that passed the inclusion criteria were further assessed according to a number of exclusion criteria. Determining if the exclusion criteria are satisfied could require a deeper analysis of the study, e.g., examining the approach and/or results sections of the paper. The applied exclusion criteria are:

  1. [label=EX0]

  2. The study does not actually propose a predictive process monitoring method.

  3. The study does not concern outcome-oriented prediction.

  4. The technique proposed in the study is tailored to a specific labeling function.The study assumes a labeling function for the case outcome that is not black-box.

  5. The study does not take an event log as input.

The EX1 criterion excludes overview papers, as well as studies that, after a more thorough examination, turned out to be focusing on some research question other than predictive process monitoring. EX2 excludes studies where the prediction target is something other than the final outcome. Common examples of other prediction targets that are considered irrelevant to this study are remaining time and next activity prediction. Using EX3, we excluded studies that are not directly about classification, i.e., that do not follow a black-box prediction of the case class. For example, studies that predict deadline violations by means of setting a threshold on the predicted remaining time, rather than by directly classifying the case as likely to violate the deadline or not. The reason for excluding such studies is that, in essence, they predict a numeric value, and are thus not applicable for predicting an arbitrarily defined case outcome. EX4 concerns studies that propose methods that do not utilize at least the following essential parts of an event log: the case identifier, the timestamp and the event classes. For instance, we excluded methods that take as input numerical time series without considering the heterogeneity in the control flow (event classes). In particular, this is the case in manufacturing processes which are of linear nature (a process chain). The reason for excluding such studies is that the challenges when predicting for a set of cases of heterogenous length are different from those when predicting for linear processes. While methods designed for heterogenous processes are usually applicable to those of linear nature, it is not so vice versa. Moreover, the linear nature of a process makes it possible to apply other, more standard methods that may achieve better performance.

The application of the exclusion criteria resulted in 14 relevant studies out of the 70 studies selected in the previous step.

3.4. Primary and subsumed studies

Among the papers that successfully passed both the inclusion and exclusion criteria, we determined primary studies that constitute an original contribution for the purposes of our benchmark, and subsumed studies that are similar to one of the primary studies and do not provide a substantial contribution with respect to it.

Specifically, a study is considered subsumed if:

  • there exists a more recent and/or more extensive version of the study from the same authors (e.g., a conference paper is subsumed by an extended journal version), or

  • it does not propose a substantial improvement/modification over a method that is documented in an earlier paper by other authors, or

  • the main contribution of the paper is a case study or a tool implementation, rather than the predictive process monitoring method itself, and the method is described and/or evaluated more extensively in a more recent study by other authors.

This procedure resulted in seven primary and seven subsumed studies, listed in Table 2. In the next section we present the primary studies in detail, and classify them using a taxonomy.

Primary study Subsumed studies
de Leoni et al. (de Leoni et al., 2016) de Leoni et al. (De Leoni et al., 2014)
Maggi et al. (Maggi et al., 2014)
Lakshmanan et al. (Lakshmanan et al., 2010) Conforti et al. (Conforti et al., 2013, 2015)
di Francescomarino et al. (Di Francescomarino et al., 2017)
Leontjeva et al. (Leontjeva et al., 2015) van der Spoel et al. (Van Der Spoel et al., 2012)
Verenich et al. (Verenich et al., 2015)
Castellanos et al. (Castellanos et al., 2005) Schwegmann et al. (Schwegmann et al., 2013a, b), Ghattas et al.(Ghattas et al., 2014)
Table 2. Primary and subsumed studies

4. Analysis and taxonomy

In this section we present a taxonomy to classify the seven primary studies that we selected through our SLR. Effectively, with this section we aim at answering RQ1 (What methods exist?) and RQ2 (How to categorize them?) – cf. Section 3.1. The taxonomy is framed upon a general workflow for predictive process monitoring, which we derived by studying all the methods surveyed. This workflow is divided into two phases: offline, to train a prediction model based on historical cases, and online, to make predictions on running process cases. The offline phase, shown in Fig. 2, consists of four steps. First, given an event log, case prefixes are extracted and filtered (e.g., to retain only prefixes up to a certain length). Next, the identified prefixes are divided into buckets (e.g., based on process states or similarities among prefixes) and features are encoded from these buckets for classification. Finally, each bucket of encoded prefixes is used to train a classifier.

Figure 2. predictive process monitoring workflow (offline phase)

The online phase, shown in Fig. 3, concerns the actual prediction for a running trace, by reusing the elements (buckets, classifiers) built in the offline phase. Specifically, given a running trace and a set of buckets of historical prefixes, the correct bucket is first determined. Next, this information is used to encode the features of the running trace for classification. In the last step, a prediction is extracted from the encoded trace using the correct classifier for the determined bucket.

Figure 3. predictive process monitoring workflow (online phase)

We note that there is an exception among the surveyed methods that does not perfectly fit the presented workflow. Namely, the KNN approach proposed by Maggi et al. 

(Maggi et al., 2014) omits the offline phase. Instead, in this approach the bucket (a set of similar traces from the training set) is determined and a classifier is trained during the online phase, separately for each running case.

Table 3 lists the seven primary studies identified in our SLR, and shows their characteristics according to the four steps of the offline phase (prefix selection and filtering, trace bucketing, sequence encoding and classification algorithm). In the rest of this section we survey these studies based on these characteristics, and use this information to build a taxonomy that allows us to classify the studies.

max width= Prefix extraction and Sequence encoding Primary study filtering Trace bucketing Control flow Data Classification algorithm de Leoni et al. (de Leoni et al., 2016) all Single agg, last state agg, last state DT Maggi et al. (Maggi et al., 2014) all KNN agg last state DT Lakshmanan et al. (Lakshmanan et al., 2010) all State last state last state DT di Francescomarino et al. (Di Francescomarino et al., 2017) prefix length 1-21, Cluster agg last state DT, RF with gap 3, 5, or 10 Leontjeva et al. (Leontjeva et al., 2015) prefix length 2-20 Prefix length index index DT, RF, GBM, SVM index last state RF agg - RF Verenich et al. (Verenich et al., 2015) prefix length 2-20 Prefix length + cluster index index RF Castellanos et al. (Castellanos et al., 2005) all Domain knowledge unknown unknown DT

Table 3. Classification of the seven primary studies according to the four steps of the offline phase.

4.1. Prefix extraction and filtering

After analyzing the identified studies, we found that all of them take as input a prefix log (as defined in Section 2) to train a classifier. This choice is natural given that at runtime, we need to make predictions for partial traces rather than completed ones. Using a prefix log for training ensures that our training data is comparable to the testing data. For example, for a complete trace consisting of a total of 5 events, we could consider up to 4 prefixes: the partial trace after executing the first event, the partial trace after executing the first and the second event, and so on.

Using all possible prefixes raises multiple problems. Firstly, the large number of prefixes as compared to the number of traces considerably slows down the training of the prediction models. Secondly, if the length of the original cases is very heterogenous, the longer traces produce much more prefixes than shorter ones and, therefore, the prediction model is biased towards the longer cases. Accordingly, it is common to consider prefixes up to a certain number of events only. For example, Di Francescomarino et al. (Di Francescomarino et al., 2017) limit the maximum prefix length to 21, while Leontjeva et al. (Leontjeva et al., 2015) use prefixes of up to 20 events only. In other words, in their training phase, these approaches take as input the length-filtered prefix log for and .

Di Francescomarino et al. (Di Francescomarino et al., 2017) propose a second approach to filter the prefix log using so-called gaps. Namely, instead of retaining all prefixes of up to a certain length, they retain prefixes whose length is equal to a base number (e.g., 1) plus a multiple of a gap (e.g., 1, 6, 11, 16, 21 for a gap of 5) . This approach helps to keep the prefix log sufficiently small for applications where efficiency of the calculations is a major concern.

We observe that length-based or gap-based filtering can be applied to any predictive process monitoring method. In other words, the choice of length or gap filtering is not an inherent property of a method.

4.2. Trace bucketing

Most of existing predictive process monitoring approaches train multiple classifiers rather than a single one. In particular, the prefix traces in the historical log are divided into several buckets and different classifiers are trained for each such buckets. At run-time, the most suitable bucket for the ongoing case is determined and the respective classifier is applied to make a prediction. In the following, we describe the bucketing approaches that have been proposed by existing predictive process monitoring methods.

4.2.1. Single bucket

All prefix traces are considered to be in the same bucket. A single classifier is trained on the whole prefix log and applied directly to the running cases. The single bucket approach has been used in the work by de Leoni et al. (de Leoni et al., 2016).

4.2.2. Knn

In this bucketing approach, the offline training phase is skipped and the buckets are determined at run-time. Namely, for each running prefix trace, its nearest neighbors are selected from the historical prefix traces and a classifier is trained (at run-time) based on these neighbors. This means that the number of buckets (and classifiers) is not fixed, but grows with each executed event at run-time.

The KNN method for predictive process monitoring was proposed by Maggi et al. (Maggi et al., 2014). Namely, they calculate the similarities between prefix traces using string-edit distance on the control flow. All instances that exceed a specified similarity threshold are considered as neighbors of the running trace. If the number of neighbors found is less than 30, the top 30 similar neighbors are selected regardless of the similarity threshold.

4.2.3. State

In state-based approaches, a process model is derived from the event log. Then, relevant states (or decision points) are determined from the process model and one classifier is trained for each such state. At run-time, the current state of the running case is determined, and the respective classifier is used to make a prediction for the running case.

Given an event log, Lakshmanan et al. (Lakshmanan et al., 2010) construct a so-called activity graph where there is one node per possible activity (event class) in the log, and there is a directed edge from node to iff has occurred immediately after in at least one trace. This type of graph is also known as the Directly-Follows Graph (DFG) of an event log (van der Aalst, 2016). We observe that the DFG is the state-transition system obtained by mapping each trace prefix in the log to a state corresponding to the last activity appearing in the trace prefix (and hence the state of a running case is fully determined by its last activity). Alternative methods for constructing state abstractions are identified in (Van Der Aalst et al., 2010) (e.g., set-based, multiset-based and sequence-based state abstractions), but these have not been used for predictive process monitoring, and they are likely not to be suitable since they generate a very large number of states, which would lead to very large number of buckets. Most of these buckets would be too small to train a separate classifier.

In Lakshmanan et al. (Lakshmanan et al., 2010)

, the edges in the DFG are annotated with transition probabilities. The transition probability from node

to captures how often after performing activity ,

is performed next. We observe that this DFG annotated with transition probabilities is a first order Markov chain. For our purposes however, the transition probabilities are not necessary, as we aim to make a prediction for any running case regardless of its frequency. Therefore, in the rest of this paper, we will use the DFG without transition probabilities.

Lakshmanan et al. (Lakshmanan et al., 2010) build one classifier per decision point — i.e., per state in the model where the execution splits into multiple alternative branches. Given that in our problem setting, we need to be able to make a prediction for a running trace after each event, a natural extension to their approach is to build one classifier for every state in the process model.

4.2.4. Clustering

The cluster-based bucketer relaxes the requirement of a direct transition between the buckets of two subsequent prefixes. Conversely, the buckets (clusters) are determined by applying a clustering algorithm on the encoded prefix traces. This results in a number of clusters that do not exhibit any transitional structure. In other words, the buckets of and are determined independently from each other. Both of these prefixes might be assigned to the same cluster or different ones.

One classifier is trained per each resulting cluster, considering only the historical prefix traces that fall into that particular cluster. At run-time, the cluster of the running case is determined based on its similarity to each of the existing clusters and the respective classifier is applied.

A clustering-based approach is proposed by di Francescomarino et al. (Di Francescomarino et al., 2017). They experiment with two clustering methods, DBScan (with string-edit distance) and model-based clustering (with Euclidean distance on the frequencies of performed activities), while neither achieves constantly superior performance over the other. Another clustering-based method is introduced by Verenich et al. (Verenich et al., 2015). In their approach, the prefixes are encoded using index-based encoding (see 4.3.4) using both control flow and data payload, and then either hierarchical agglomerative clustering (HAC) or k-medoids clustering is applied. According to their results, k-medoids clustering consistently outperforms HAC.

4.2.5. Prefix length

In this approach, each bucket contains only the partial traces of a specific length. For example, one bucket contains traces where only the first event has been executed, another bucket contains those where first and second event have been executed, and so on. One classifier is built for each possible prefix length. The prefix length based bucketing was proposed by Leontjeva et al. (Leontjeva et al., 2015). Also, Verenich et al. (Verenich et al., 2015) bucket the prefixes according to prefix length before applying a clustering method.

4.2.6. Domain knowledge

While the bucketing methods described so far can detect buckets through an automatic procedure, it is possible to define a bucketing function that is based on manually constructed rules. In such an approach, the input from a domain expert is needed. The resulting buckets can, for instance, refer to context categories (Ghattas et al., 2014) or execution stages (Castellanos et al., 2005; Schwegmann et al., 2013a).

The aim of this survey and benchmark is to derive general principles by comparing methods that are applicable in arbitrary outcome-based predictive process monitoring scenarios and, thus, the methods that are based on domain knowledge about a particular dataset are left out of scope. For this reason, we do not further consider bucketing approaches based on domain knowledge.

4.3. Sequence encoding

In order to train a classifier, all prefix traces in the same bucket need to be represented as fixed length feature vectors. The main challenge here comes from the fact that with each executed event, additional information about the case becomes available, while each trace in a bucket (independent of the number of executed events) should still be represented with the same number of features. This can be achieved by applying a trace abstraction technique (Van Der Aalst et al., 2010), for example, considering only the last events of a trace. However, choosing an appropriate abstraction is a difficult task, where one needs to balance the trade-off between the generality222Generality in this context means being able to apply the abstraction technique to as many prefix traces as possible; as an example, the last states abstraction is not meaningful for prefixes that are shorter than events. and loss of information. After a trace abstraction is chosen, a set of feature extraction functions may be applied to each event data attribute of the abstracted trace. Therefore, a sequence encoding method can be thought of as a combination of a trace abstraction technique and a set of feature extraction functions for each data attribute.

In the following subsections we describe the sequence encoding methods that have been used in the existing predictive process monitoring approaches. As described in Section 2, a trace can contain any number of static case attributes and dynamic event attributes. Both the case and the event attributes can be of numeric, categorical, or textual type. As none of the compared methods deal with textual data, hereinafter we will focus on numeric and categorical attributes only.

4.3.1. Static

The encoding of case attributes is rather straightforward. As they remain the same throughout the whole case, they can simply be added to the feature vector “as is” without any loss of information. In order to represent all the information as a numeric vector, we assume the “as is” representation of a categorical attribute to be one hot encoding. This means that each value of a categorical attribute is transformed into a bitvector , where is the number of possible levels of that attribute, if the given value is equal to the th level of the attribute, and otherwise.

4.3.2. Last state

In this encoding method, only the last available snapshot of the data is used. Therefore, the size of the feature vector is proportional to the number of event attributes and is fixed throughout the execution of a case. A drawback of this approach is that it disregards all the information that has happened in the past, using only the very latest data snapshot. To alleviate this problem, this encoding can easily be extended to the last states, in which case the size of the feature vector increases times. As the size of the feature vector does not depend on the length of the trace, the last state (or, the last states) encoding can be used with buckets of traces of different lengths.

Using the last state abstraction, only one value (the last snapshot) of each data attribute is available. Therefore, no meaningful aggregation functions can be applied. Similarly to the static encoding, the numeric attributes are added to the feature vector “as is”, while one hot encoding is applied to each categorical attribute.

Last state encoding is the most common encoding technique, having been used in the KNN approach (Maggi et al., 2014), state-based bucketing (Lakshmanan et al., 2010), as well as the clustering-based bucketing approach by Di Francescomarino et al. (Di Francescomarino et al., 2017). Furthermore, De Leoni et al. (de Leoni et al., 2016) mention the possibility of using the last and the previous (the last two) states.

4.3.3. Aggregation

The last state encoding has obvious drawbacks in terms of information loss, neglecting all data that have been collected in the earlier stages of the trace. Another approach is to consider all events since the beginning of the case, but ignore the order of the events. This abstraction method paves the way to several aggregation functions that can be applied to the values that an event attribute has taken throughout the case.

In particular, the frequencies of performed activities (control flow) have been used in several existing works (Leontjeva et al., 2015; Di Francescomarino et al., 2017). Alternatively, boolean values have been used to express whether an activity has occurred in the trace. However, the frequency-based encoding has been shown to be superior to the boolean encoding (Leontjeva et al., 2015). For numerical attributes, De Leoni et al. (de Leoni et al., 2016) proposed using general statistics, such as the average, maximum, minimum, and sum.

4.3.4. Index

While the aggregation encoding exploits information from all the performed events, it still exhibits information loss by neglecting the order of the events. The idea of index-based encoding is to use all possible information (including the order) in the trace, generating one feature per each event attribute per each executed event (each index). This way, a lossless encoding of the trace is achieved, which means that it is possible to completely recover the original trace based on its feature vector. A drawback of index-based encoding is that due to the fact that the length of the feature vector increases with each executed event, this encoding can only be used in homogenous buckets where all traces have the same length.

Index-based encoding was proposed by Leontjeva et al. (Leontjeva et al., 2015). Additionally, in their work they combined the index-based encoding with HMM log-likelihood ratios. However, we decided not to experiment with HMMs in this study for mainly two reasons. Firstly, the HMMs did not consistently improve the basic index-based encoding in (Leontjeva et al., 2015). Secondly, rather than being an essential part of index-based encoding, HMMs can be thought of as an aggregation function that can be applied to each event attribute, similarly to taking frequencies or numeric averages. Therefore, HMMs are not exclusive to index-based encoding, but could also be used in conjunction with the aggregation encoding. Index-based encoding is also used in the approach of Verenich et al. (Verenich et al., 2015).

Summary

An overview of the encoding methods can be seen in Table 4. Note that the static encoding extracts different type of data from the trace (case attributes) than the other three methods (event attributes). Therefore, for obtaining a complete representation for a trace, it is reasonable to concatenate the static encoding with one of the other three encodings. In our experiments, the static encoding is included in every method, e.g., the “last state” method in the experiments refers to the static encoding for case attributes concatenated with the last state encoding for event attributes.

max width= Encoding Relevant Trace Feature extraction name attributes abstraction Numeric Categorical Static Case Case attributes as is one-hot Last state Event Last event as is one-hot Aggregation Event All events, unordered min, max, mean, frequencies or (set/bag) sum, std occurrences Index Event All events, ordered as is one-hot (sequence) for each index for each index

Table 4. Encoding methods

4.4. Classification algorithm

The existing predictive process monitoring methods have experimented with different classification algorithms. The most popular choice has been decision tree (DT), which has obvious benefits in terms of the interpretability of the results. Another popular method has been random forest 

(Breiman, 2001) (RF), which usually achieves better prediction accuracy than a single decision tree, but is harder to interpret. Additionally, Leontjeva et al. (Leontjeva et al., 2015)

experimented with support vector machines (SVM) and generalized boosted regression models (GBM), but found that their performance is inferior to RF. Recently, gradient boosted trees 

(Friedman, 2001) in conjunction with existing predictive process monitoring techniques have shown promising results, often outperforming RF (Rozumnyi, 2017; Senderovich et al., 2017).

4.5. Discussion

We have observed that the prefix filtering techniques are not inherent to any given predictive process monitoring method. Instead, these techniques are selected based on performance considerations and can be used in conjunction with any of the predictive process monitoring methods. In a similar vein, the choice of a classification algorithm is a general problem in machine learning and is not specific to business process data. In fact, all of the authors of the methods reviewed above claim that their method is applicable in conjunction with any classifier. Therefore, we consider neither the prefix filtering technique nor the classification algorithm employed to be relevant aspects when categorizing and comparing predictive process monitoring methods.

The considered methods also differ in terms of the event log attributes that are used for making predictions. However, it has been shown (Leontjeva et al., 2015) that including more information (i.e., combining control flow and data payload) can drastically increase the predictive power of the models. In order to provide a fair comparison of the different methods, it is preferable to provide the same set of attributes as input to all methods, and preferably the largest possible set of attributes. Accordingly, in the comparative evaluation below, we will encode traces using all the available case and event attributes (covering both control flow and data payload).

Based on the above, we conclude that existing outcome-oriented predictive process monitoring methods can be compared on two grounds:

  • how the prefix traces are divided into buckets (trace bucketing)?

  • how the (event) attributes are transformed into features (sequence encoding)?

Figure 4 provides a taxonomy of the relevant methods based on these two perspectives. Note that although the taxonomy is based on 7 primary studies, it contains 11 different approaches. The reason for this is that while the primary approaches tend to mix different encoding schemes, for example, use aggregation encoding for control flow and last state encoding for data payload (see Table 3), the taxonomy is constructed in a modular way, so that each encoding method constitutes a separate approach. In order to provide a fair comparison of different encoding schemes, we have decided to evaluate each encoding separately, while the same encoding is applied to both control flow and data payload. Still, the different encodings (that are valid for a given bucketing method) can be easily combined, if necessary. Similarly, the taxonomy does not contain combinations of several bucketing methods. An example of such “double bucket” approaches is the method by Verenich et al. (Verenich et al., 2015), where the prefixes are first divided into buckets based on prefix length and, in turn, clustering is applied in each bucket. We believe that comparing the performance of each bucketing method separately (rather than as a combination) provides more insights about the benefits of each method. Furthermore, the double bucket approaches divide the prefixes into many small buckets, which often leads to situations where a classifier receives too little training instances to learn meaningful patterns.

We note that the taxonomy generalizes the state-of-the-art, in the sense that even if a valid pair of bucketing and encoding method has not been used in any existing approach in the literature, it is included in the taxonomy (e.g., the state-based bucketing approach with aggregation encoding). We also note that while the taxonomy covers the techniques proposed in the literature, all these techniques rely on applying a propositional classifier on an explicit vectorial representation of the traces. One could envisage alternative approaches that do not require an explicit feature vector as input. For instance, kernel-based SVMs have been used in the related setting of predicting the cycle time of a case (van Dongen et al., 2008)

. Furthermore, one could envisage the use of data mining techniques to extract additional features from the traces (e.g., latent variables or frequent patterns). Although the taxonomy does not cover this aspect explicitly, applying such techniques is consistent with the taxonomy, since the derived features can be used in combination with any of the sequence encoding and bucketing approaches presented here. While most of the existing works on outcome-oriented predictive monitoring use the event/trace attributes “as-is” without an additional mining step, Leontjeva et al. used Hidden Markov Models for extracting additional features in combination with index-based encoding 

(Leontjeva et al., 2015)

. Similarly, Teinemaa et al. applied different natural language processing techniques to extract features from textual data 

(Teinemaa et al., 2016). Further on, although not yet applied to outcome-oriented predictive process monitoring tasks, different pattern mining techniques could be applied to extract useful patterns from the sequences, occurrences of which could then be used as features in the feature vectors of the traces. Such techniques have been used in the domain of early time series/sequence classification (Xing et al., 2008; Ghalwash et al., 2013; Ghalwash and Obradovic, 2012; He et al., 2015; Lin et al., 2015) and for predicting numerical measures (e.g., remaining time) for business processes (Folino et al., 2014).

Figure 4. Taxonomy of methods for predictive monitoring of business process outcome.

5. Benchmark

After conducting our survey, we proceeded with benchmarking the 11 approaches (shown in Figure 4) using different evaluation criteria (prediction accuracy, earliness and computation time), to address RQ3 (What is the relative performance of these methods?) – cf. Section 3.1.

To perform our benchmark, we implemented an open-source, tunable and extensible predictive process monitoring framework in Python.333The code is available at https://github.com/irhete/predictive-monitoring-benchmark All experiments were run using Python 3.6 and the scikit-learn library (Pedregosa et al., 2011) on a single core of a Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz with 64GB of RAM.

In the rest of this section we first introduce the evaluation datasets, then describe the evaluation procedure and conclude with discussing the results of the experiments.

5.1. Datasets

The benchmark is based on nine real-life event logs, out of which eight are publicly available and one is a private dataset. The public logs are accessible from the 4TU Centre for Research Data.444https://data.4tu.nl/repository/collection:event_logs_real The private log Insurance originates from a claims handling process at an Australian insurance company. The criterion for selecting the public event logs for the evaluation was that the log must contain both case attributes (static) and event attributes (dynamic). Based on this, we discarded the logs from years 2013-2014. We also discarded the BPIC 2016 dataset because it is a click-dataset of a Web service, rather than an event log of a business process.

In case of some logs, we applied several labeling functions . In other words, the outcome of a case is defined in several ways depending on the goals and needs of the process owner. Each such notion of the outcome constitutes a separate predictive process monitoring task with slightly different input datasets. In total, we formulated a total of 24 different outcome prediction tasks based on the nine original event logs. In the following paragraphs we describe the original logs, the applied labeling functions, and the resulting predictive monitoring tasks in more detail.

BPIC2011. This event log contains cases from the Gynaecology department of a Dutch Academic Hospital. Each case assembles the medical history of a given patient, where the applied procedures and treatments are recorded as activities. Similarly to previous work (Leontjeva et al., 2015; Di Francescomarino et al., 2017), we use four different labeling functions based on LTL rules (Pnueli, 1977). Specifically, we define the class label for a case according to whether an LTL rule is violated or satisfied by each trace .

Table 5 introduces the semantics of the LTL operators.

operator semantics
has to hold in the next position of a path.
has to hold always in the subsequent positions of a path.
has to hold eventually (somewhere) in the subsequent positions of a path.
has to hold in a path at least until holds. must hold in the current or
in a future position.
Table 5. LTL Operators Semantics

The four LTL rules used to formulate the four prediction tasks on the BPIC 2011 log are as follows:

  • bpic2011_1: ,

  • bpic2011_2: ,

  • bpic2011_3: , and

  • bpic2011_4: .

For example, the for bpic2011_1 expresses the rule that at least one of the activities “tumor marker CA-19.9” or “ca-125 using meia” must happen eventually during a case. Evidently, the class label of a case becomes known and irreversible when one of these two events has been executed. In order to avoid bias introduced by this phenomenon during the evaluation phase, all the cases are cut exactly before either of these events happens. Similarly, the cases are cut before the occurrence of “histological examination-biopsies nno” in the bpic2011_3 dataset and before “histological examination-big resectiep” in bpic2011_4. However, no cutting is performed in the bpic2011_2 dataset, because the states that a “CEA-tumor marker using meia” event must always be followed by a “squamous cell carcinoma using eia” event sometime in the future. Therefore, even if one occurrence of “CEA-tumor marker using meia” has successfully been followed by a “squamous cell carcinoma using eia” ( is satisfied), another occurrence of “CEA-tumor marker using meia” will cause the to be violated again and, thus, the class label is not irreversibly known until the case completes.

BPIC2015. This dataset assembles event logs from 5 Dutch municipalities, pertaining to the building permit application process. We treat the datasets from each municipality as separate event logs and apply a single labeling function to each one. Similarly to BPIC 2011, the labeling function is based on the satisfaction/violation of an LTL rule . The prediction tasks for each of the 5 municipalities are denoted as bpic2015_i, where indicates the number of the municipality. The LTL rule used in the labeling functions is as follows:

  • bpic2015_i: .

No trace cutting can be performed here, because, similarly to bpic2011_2, the final satisfaction/violation of is not known until the case completes.

Production. This log contains data from a manufacturing process. Each trace records information about the activities, workers and/or machines involved in producing an item. The labeling (production) is based on whether or not the number of rejected work orders is larger than zero.

Insurance. This is the only private log we use in the experiments. It comprises of cases from an Australian insurance claims handling process. We apply two labeling functions:

  • insurance_1: is based on whether a specific “key” activity is performed during the case or not.

  • insurance_2: is based on the time taken for handling the case, dividing them into slow and fast cases.

Sepsis cases. This log records trajectories of patients with symptoms of the life-threatening sepsis condition in a Dutch hospital. Each case logs events since the patient’s registration in the emergency room until her discharge from the hospital. Among others, laboratory tests together with their results are recorded as events. Moreover, the reason of the discharge is available in the data in an obfuscated format. We created three different labelings for this log:

  • sepsis_1: the patient returns to the emergency room within 28 days from the discharge,

  • sepsis_2: the patient is (eventually) admitted to intensive care,

  • sepsis_3: the patient is discharged from the hospital on the basis of something other than Release A (i.e., the most common release type).

BPIC2012. This dataset, originally published in relation to the Business Process Intelligence Challenge (BPIC) in 2012, contains the execution history of a loan application process in a Dutch financial institution. Each case in this log records the events related to a particular loan application. For classification purposes, we defined some labelings based on the final outcome of a case, i.e., whether the application is accepted, rejected, or canceled. Intuitively, this could be thought of as a multi-class classification problem. However, to remain consistent with previous work on outcome-oriented predictive process monitoring, we approach it as three separate binary classification tasks. In the experiments, these tasks are referred to as bpic2012_1, bpic2012_2, and bpic2012_3.

BPIC2017. This event log originates from the same financial institution as the BPIC2012 one. However, the data collection has been improved, resulting in a richer and cleaner dataset. As in the previous case, the event log records execution traces of a loan application process. Similarly to BPIC2012, we define three separate labelings based on the outcome of the application, referred to as bpic2017_1, bpic2017_2, and bpic2017_3.

Hospital billing. This dataset comes from an ERP system of a hospital. Each case is an execution of a billing procedure for medical services. We created two labelings for this log:

  • hospital_1: the billing package not was eventually closed,

  • hospital_2: the case is reopened.

Traffic fines. This log comes from an Italian local police force. The dataset contains events about notifications sent about a fine, as well as (partial) repayments. Additional information related to the case and to the individual events include, for instance, the reason, the total amount, and the amount of repayments for each fine. We created the labeling (traffic) based on whether the fine is repaid in full or is sent for credit collection.

The resulting 24 datasets exhibit different characteristics which can be seen in Table 6. The smallest log is production which contains 220 cases, while the largest one is traffic with 129615 cases. The most heterogenous in terms of case length are the bpic2011 labelled datasets, where the longest case consists of 1814 events. On the other hand, the most homogenous is the traffic log, where case length varies from 2 to 20 events. The class labels are the most imbalanced in the hospital_2 dataset, where only 5% of cases are labeled as positive ones (class label = 1). Conversely, in bpic2012_1, bpic2017_3, and traffic, the classes are almost balanced. In terms of event classes, the most homogenous are the insurance datasets, with only 9 distinct activity classes. The most heterogenous are the bpic2015 datasets, reaching 396 event classes in bpic2015_2. The datasets also differ in terms of the number of static and dynamic attributes. The insurance logs contain the most dynamic attributes (22), while the sepsis datasets contain the largest number of static attributes (24).

max width= min med max trunc # variants pos class # event # static # dynamic # static # dynamic dataset # traces length length length length (after trunc) ratio classes attr-s attr-s cat levels cat levels bpic2011_1 1140 1 25.0 1814 36 176 0.4 193 6 14 961 290 bpic2011_2 1140 1 54.5 1814 40 218 0.78 251 6 14 994 370 bpic2011_3 1121 1 21.0 1368 31 167 0.23 190 6 14 886 283 bpic2011_4 1140 1 44.0 1432 40 205 0.28 231 6 14 993 338 bpic2015_1 696 2 42.0 101 40 297 0.23 380 17 12 19 433 bpic2015_2 753 1 55.0 132 40 329 0.19 396 17 12 7 429 bpic2015_3 1328 3 42.0 124 40 334 0.2 380 18 12 18 428 bpic2015_4 577 1 42.0 82 40 276 0.16 319 15 12 9 347 bpic2015_5 1051 5 50.0 134 40 288 0.31 376 18 12 8 420 production 220 1 9.0 78 23 26 0.53 26 3 15 37 79 insurance_1 1065 6 12.0 100 8 9 0.16 9 0 22 0 207 insurance_2 1065 6 12.0 100 13 9 0.26 9 0 22 0 207 sepsis_1 782 5 14.0 185 29 15 0.14 15 24 13 200 39 sepsis_2 782 4 13.0 60 13 15 0.14 15 24 13 200 40 sepsis_3 782 4 13.0 185 22 15 0.86 15 24 13 200 40 bpic2012_1 4685 15 35.0 175 40 36 0.48 36 1 10 0 99 bpic2012_2 4685 15 35.0 175 40 36 0.17 36 1 10 0 99 bpic2012_3 4685 15 35.0 175 40 36 0.35 36 1 10 0 99 bpic2017_1 31413 10 35.0 180 20 25 0.41 26 3 20 13 194 bpic2017_2 31413 10 35.0 180 20 25 0.12 26 3 20 13 194 bpic2017_3 31413 10 35.0 180 20 25 0.47 26 3 20 13 194 traffic 129615 2 4.0 20 10 10 0.46 10 4 14 54 173 hospital_1 77525 2 6.0 217 6 18 0.1 18 1 21 23 1756 hospital_2 77525 2 6.0 217 8 17 0.05 17 1 21 23 1755

Table 6. Statistics of the datasets used in the experiments.

While most of the data attributes can be readily included in the train and test datasets, timestamps should be preprocessed in order to derive meaningful features. In our experiments, we use the following features extracted from the timestamp: month, weekday, hour, duration from the previous event in the given case, duration from the start of the case, and the position of the event in the case. Additionally, some recent works have shown that adding features extracted from collections of cases (inter-case features) increases the accuracy of the predictive models, particularly when predicting deadline violations (Conforti et al., 2015; Senderovich et al., 2017). For example, waiting times are highly dependent on the number of ongoing cases of a process (the so-called “Work-In-Process”). In turn, waiting times may affect the outcome of a case, particularly if the outcome is defined with respect to a deadline or with respect to customer satisfaction. Accordingly, we extract an inter-case feature reflecting the number of cases that are “open” at the time of executing a given event. All the abovementioned features are used as numeric dynamic (event) attributes.

Not surprisingly, recent work along this latter direction has shown that adding features extracted from collections of cases (inter-case features) increases the accuracy of the predictive models, particularly when predicting deadline violations (Conforti et al., 2015; Senderovich et al., 2017).

Each of the categorical attributes has a fixed number of possible values, called levels. For some attributes, the number of distinct levels can be very large, with some of the levels appearing in only a few cases. In order to avoid exploding the dimensionality of the input dataset, we filter only the category levels that appear in at least 10 samples. This filtering is applied to each categorical attribute except the event class (activity), where we use all category levels.

5.2. Experimental set-up

In this subsection, we start with describing the employed evaluation measures. We then proceed with describing our approach to splitting the event logs into train and test datasets and optimizing the hyperparameters of the compared methods.

5.2.1. Research questions and evaluation measures

In a predictive process monitoring use case, the quality of the predictions is typically measured with respect to two main dimensions based on the following desiderata: A good prediction should be accurate and it should be made in the early stages of the process. A prediction that is often inaccurate is a useless prediction, as it cannot be relied on when making decisions. Therefore, accuracy is, in a sense, the most important quality of a prediction. The earlier an accurate prediction is made, the more useful it is in practice, as it leaves more time to act upon the prediction. Based on this rationale, we formulate the first subquestion (RQ3.1) as follows: How do the existing outcome-oriented predictive business process monitoring techniques compare in terms of accuracy and earliness of the predictions?

Different metrics can be used to measure the accuracy of predictions. Rather than returning a hard prediction (a binary number) on the expected case outcome, the classifiers usually output a real-valued score, reflecting how likely it is that the case will end in one way or the other. A good classifier will give higher scores to cases that will end with a positive outcome, and lower values to those ending with a negative one. Based on this intuition, we use the area under the ROC curve (AUC) metric that expresses the probability that a given classifier will rank a positive case higher than a negative one. A major advantage of the AUC metric over the commonly used accuracy

(the proportion of correctly classified instances) or F-score (the harmonic mean of precision and recall), is that the AUC remains unbiased even in case of a highly imbalanced distribution of class labels 

(Bradley, 1997).

From the literature, two different approaches emerge for measuring the earliness of the predictions. One way (Leontjeva et al., 2015) is to evaluate the models separately for each prefix length. In each step, the prediction model is applied to a subset of prefixes of exactly the given length. The improvement of prediction accuracy as the prefix length increases provides an implicit notion of earliness. In particular, the smaller the prefix length when an acceptable level of accuracy is reached, the better the method in terms of earliness. If needed, earliness can be defined explicitly as a metric — the smallest prefix length where the model achieves a specified accuracy threshold. Another option is to keep monitoring each case until the classifier gives an outcome prediction with a sufficiently high confidence, and then we measure the earliness as the average prefix length when such a prediction is made (Di Francescomarino et al., 2017; Teinemaa et al., 2016). The latter approach is mostly relevant in failure prediction scenarios, when the purpose is to raise an alarm when the estimated risk becomes higher than a pre-specified threshold (for a survey of failure prediction models, see (Salfner et al., 2010)). However, even when the predictions come with a high confidence score, they might not necessarily be accurate. In the benchmark, we employ the first approach to measuring earliness, evaluating the models on each prefix length, because it provides a more straightforward representation of earliness that is relevant for all business processes. Also, it assures that the “early predictions” have reached a suitable level of accuracy.

In order to be applicable in practice a prediction should be produced efficiently, i.e., the execution times should be suitable for a given application. To address this, we formulate the second subquestion (RQ3.2) as follows: How do the existing outcome-oriented predictive business process monitoring techniques compare in terms of execution times? When measuring the execution times of the methods, we distinguish the time taken in the offline and the online modes. The offline time is the total time needed to construct the classifier from the historic traces available in an event log. Namely, it includes the time for constructing the prefix log, bucketing and encoding the prefix traces, and training the classifier. In the online phase, it is essential that a prediction is produced almost instantaneously, as the predictions are usually needed in real time. Accordingly, we define the online time as the average time for processing one incoming event (incl. bucketing, encoding, and predicting based on this new event).

The execution times are affected by mainly two factors. Firstly, since each prefix of a trace constitutes one sample, the lengths of the traces have a direct effect on the number of (training) samples. It is natural that the more samples are used for training, the better accuracy the predictive monitoring system could yield. At the same time, using more samples increases the execution times of the system. In applications where the efficiency of the predictions is critical importance, reducing the number of training samples can yield a reasonable tradeoff, bringing down the execution times to a suitable level, while accepting lower accuracy. One way to reduce the number of samples is gap-based filtering (Di Francescomarino et al., 2017), where a prefix is added to the training set only after each events in the trace. This leads us to the third subquestion (RQ3.3): To what extent does gap-based filtering improve the execution times of the predictions?

The second factor that affects the execution times is the number and the diversity of attributes that need to be processed. In particular, the number of unique values (levels) in the categorical attribute domains has a direct effect on the length of the feature vector constructed for each sample, since each level corresponds to a feature in the vector (this holds for one hot encoding, as well as using occurrences or frequencies). The dimensionality of the vector can be controlled by filtering of the levels, for instance, by using only the most frequent levels for each categorical attribute. However, such filtering may negatively impact the accuracy of the predictions. In the fourth subquestion (RQ3.4), we aim to answer the following: To what extent does filtering the levels of categorical attributes based on their frequencies improve the execution times of the predictions?

5.2.2. Train-test split

In order to simulate the real-life situation where prediction models are trained using historic data and applied to ongoing cases, we employ a temporal split to divide the event log into train and test cases. Namely, the cases are ordered according to the start time and the first 80% are used for training the models, while the remaining 20% are used to evaluate the performance. In other words, the classifier is trained with all cases that started before a given date, and the testing is done only on cases that start afterwards. Note that, using this approach, some events in the training cases could still overlap with the test period. In order to avoid that, we cut the training cases so that events that overlap with the test period are discarded.

5.2.3. Classifier learning and bucketing parameters

We selected four classification algorithms for the experiments: random forest (RF), gradient boosted trees (XGBoost), logistic regression (logit), and support vector machines (SVM). We chose logistic regression because of its simplicity and wide application in various machine learning applications. SVM and RF have been used in existing outcome-oriented predictive monitoring studies, whereas RF has shown to outperform many other methods (such as decision trees) in both predictive monitoring scenarios 

(Leontjeva et al., 2015) and in more general empirical studies (Fernández-Delgado et al., 2014). We also included the XGBoost classifier which has recently gained attention and showed promising results when applied to business process data (Rozumnyi, 2017; Senderovich et al., 2017). Furthermore, a recent empirical study on the performance of classification algorithms across 165 datasets has shown that RF and boosted trees generally outperform other classifier learning techniques (Olson et al., 2017). For the clustering-based bucketing approach (cf. Section 4.2.4

), we use the k-means clustering algorithm, which is one of the most widely used clustering methods in general.

The classification algorithms as well as some of the bucketing methods (clustering and KNN), require one to specify a number of parameters. In order to achieve good performance with each of the techniques, we optimize the hyperparameters using the Tree-structured Parzen Estimator (TPE) algorithm (Bergstra et al., 2011), separately for each combination of a dataset, a bucketing method, and a sequence encoding method. For each combination of parameter values (i.e., a configuration) we performed 3-fold cross validation within the whole set of prefix traces extracted from the training set, and we selected the configuration that led to the highest mean AUC calculated across the three folds. In the case of the prefix length based bucketing method, an optimal configuration was chosen for each prefix length separately (i.e., for each combination of a dataset, a bucketing method, an encoding approach and a prefix length). Table 7 presents the bounds and the sampling distributions for each of the parameters, given as input to the optimizer. In the case of RF and XGBoost, we found via exploratory testing that the results are almost unaffected by the number of estimators (i.e., trees) trained per model. Therefore, we use a fixed value of throughout the experiments.

Classifier Parameter Distribution Values
RF Max features Uniform
XGBoost Learning rate Uniform
Subsample Uniform
Max tree depth Uniform integer
Colsample bytree Uniform
Min child weight Uniform integer
Logit Inverse of regularization strength (C) Uniform integer
SVM Penalty parameter of the error term (C) Uniform integer
Kernel coefficient (gamma) Uniform integer
K-means Number of clusters Uniform integer
KNN Number of neighbors Uniform integer
Table 7. Hyperparameters and distributions used in optimization via TPE.

Both k-means and KNN require us to map each trace prefix into a feature vector in order to compute the Euclidean distance between pairs of prefixes. To this end, we applied the aggregation encoding approach, meaning that we map each trace to a vector that tells us how many times each possible activity appears in the trace. In order to keep consistent with the original methods, we decided to use only the control flow information for the clustering and the determining of the nearest neighbors.

In the case of the state-based bucketing, we need to specify a function that maps each trace prefix to a state. To this end, we used the last-activity encoding, meaning that one state is defined per possible activity and a trace prefix is mapped to the state corresponding to the last activity in the prefix. Note that the number of buckets produced by this approach is equal to the number of unique activities in the dataset (see Table 6). The reason for this choice is because this approach leads to reasonably large buckets. We also experimented with the multiset state abstraction approach, but it led to too many buckets, some of small size, so that in general there were not enough samples per bucket to train a classifier with sufficient accuracy.

When using a state-based or a clustering-based bucketing method, it may happen that a given bucket contains too few trace prefixes to train a meaningful classifier. Accordingly, we set a minimum bucket size threshold. If the number of trace prefixes in a bucket is less than the threshold, we do not build a classifier for that bucket but instead, any trace prefix falling in that bucket is mapped to the label (i.e., the outcome) that is predominant in that bucket, with a likelihood score equal to the ratio of trace prefixes in the bucket that have the predominant label. To be consistent with the choice of the parameter K in the KNN approach proposed in (Maggi et al., 2014), we fixed the minimum bucket size threshold to 30. Similarly, when all of the training instances in a bucket belong to the same class, no classifier is trained for this bucket and, instead, the test instances falling to this bucket are simply assigned the same class (i.e., the assigned prediction score is either 0 or 1).

In case of logit and SVM, the features are standardized by subtracting the mean and scaling to unit variance before given as input to the classifier.

5.2.4. Filtering and feature encoding parameters

As discussed in Section 4.1, training a classifier over the entire prefix log (all prefixes of all traces) can be time-consuming. Furthermore, we are only interested in making predictions for earlier events rather than making predictions towards the end of a trace. Additionally, we observe that the distributions of the lengths of the traces can be different within the classes corresponding to different outcomes (see Figures 13-14 in Appendix). When all instances of long prefixes belong to the same class, predicting the outcome for these (or longer) prefixes becomes trivial. Accordingly, during both the training and the evaluating phases, we vary the prefix length from 1 to the point where 90% of the minority class have finished (or until the end of the given trace, if it ends earlier than this point). For computational reasons, we set the upper limit of the prefix lengths to 40, except for the bpic2017 datasets where we further reduced the limit to 20. We argue that setting a limit to the maximum prefix length is a reasonable design choice, as the aim of predictive process monitoring is to predict as early as possible and, therefore, we are more interested in predictions made for shorter prefixes. When answering RQ3.3, we additionally apply the gap-based filtering to the training set with . For instance, in case of , only prefixes of lengths 1, 6, 11, 16, 21, 26, 31, and 36 are included in the training set.

In Section 4.3.3

, we noted that the aggregation encoding requires us to specify an aggregation function for each event attribute. For activities and resource attributes we use the count (frequency) aggregation function (i.e., how many times a given activity has been executed, or how many activities has a given resource executed). The same principle is applied to any other event attribute of a categorical type. For each numeric event attribute, we include two numeric features in the feature vector: the mean and the standard deviation. Furthermore, to answer RQ3.4, we filter each of the categorical attribute domains by using only the top

percent of the most frequent levels from each attribute.

For index-based encoding (Section 4.3.4), we focus on the basic index-encoding technique without the HMM extension proposed in (Leontjeva et al., 2015). The reason is that the results reported in (Leontjeva et al., 2015) do not show that HMM provides any visible improvement, and instead this encoding adds complexity to the training phase.

5.3. Results: accuracy and earliness

Tables 8, 13, 14, and 15 report the overall AUC for each dataset and method using the XGBoost, RF, logit, and SVM classifier, respectively. The overall AUC values are obtained by first calculating the AUC scores separately for each prefix length (using only prefixes of this length) and, then, taking the weighted average of these scores, where the weights are assigned according to the number of prefixes used for the calculation of a given AUC score. This weighting assures that the overall AUC is influenced equally by each prefix in the evaluation set, instead of being biased towards longer prefixes (i.e., where many cases have already finished). The best-performing classifier is XGBoost, which achieves the highest accuracy in 16 out of 24 datasets. It is followed closely by RF, which achieves top performance in 12 datasets (outperforms XGBoost in 4 and achieves equally good performance in 8). Logit and SVM in general do not reach the same level of accuracy. Exceptions are bpic2015_4, sepsis_2, and traffic, where logit outperforms the other classifiers. In order to keep the discussion comprehensible, in the following we will go into details with analysing the results obtained by XGBoost.

We can see in Table 8 that the single_agg method achieves the best results in 11 out of 24 datasets. It is followed by prefix_agg, cluster_agg, and state_agg, which provide the overall best predictions in 9, 6, and 5 datasets, respectively. The last state encodings in general perform worse than their aggregation encoding counterparts. A clear exception to that rule is bpic2011_4, where single_laststate and cluster_laststate considerably outperforms the other methods. The index-based encoding reaches the highest accuracy in only two datasets, bpic2012_2 and traffic. The KNN-based approaches are never among the top methods.

max width= bpic2011_1 bpic2011_2 bpic2011_3 bpic2011_4 insurance_1 insurance_2 single_laststate single_agg knn_laststate knn_agg state_laststate state_agg cluster_laststate cluster_agg prefix_index prefix_laststate prefix_agg bpic2015_1 bpic2015_2 bpic2015_3 bpic2015_4 bpic2015_5 production single_laststate single_agg knn_laststate knn_agg state_laststate state_agg cluster_laststate cluster_agg prefix_index prefix_laststate prefix_agg sepsis_1 sepsis_2 sepsis_3 bpic2012_1 bpic2012_2 bpic2012_3 single_laststate single_agg knn_laststate knn_agg state_laststate state_agg cluster_laststate cluster_agg prefix_index prefix_laststate prefix_agg bpic2017_1 bpic2017_2 bpic2017_3 traffic hospital_1 hospital_2 single_laststate single_agg knn_laststate knn_agg state_laststate state_agg cluster_laststate cluster_agg prefix_index prefix_laststate prefix_agg

Table 8. Overall AUC (XGBoost)

Figures 5 and 6 present the prediction accuracy in terms of AUC, evaluated over different prefix lengths555To maintain readability of the plots, some methods (namely, the KNN approaches and the last state encodings, except for single_laststate) are not shown in these figures because of their low performance according to Table 8. For a complete comparison of the methods, see Figures 17 and 18 in Appendix.. Each evaluation point includes prefix traces of exactly the given length. In other words, traces that are altogether shorter than the required prefix are left out of the calculation. Therefore, the number of cases used for evaluation is monotonically decreasing when increasing prefix length. In most of the datasets, we see that starting from a specific prefix length the methods with aggregation encoding achieve perfect prediction accuracy (). It is natural that the prediction task becomes trivial when cases are close to completion, especially if the labeling function is related to the control flow or to the data payload present in the event log. However, there are a few exceptions from this rule, namely, in the bpic2012 datasets, sepsis_2, and sepsis_3, the results seem to decline on larger prefix sizes. To investigate this phenomenon, we recalculated the AUC scores on the longer traces only, i.e., traces that have a length larger than or equal to the maximum considered trace length (see Figure 19 in Appendix). This analysis confirmed that the phenomenon is caused by the fact that the datasets contain some short traces for which it appears to be easy to predict the outcome. These short traces are not included in the later evaluation points, as they have already finished by that time. Therefore, we are left with longer traces only, which appear to be more challenging for the classifier, dragging down the total AUC score on larger prefix lengths.

Figure 5. AUC across different prefix lengths using XGBoost (1)
Figure 6. AUC across different prefix lengths using XGBoost (2)

We can see from Figure 5 that in several cases (e.g., bpic2011 and bpic2015 datasets, bpic2012_1, sepsis_2), all the methods achieve a similar AUC on shorter prefixes, but then quickly grow apart as the size of the prefix increases. In particular, the aggregation encoding seems to be able to carry along relevant information from the earlier prefixes, while the last state encoding entails more limited information that is often insufficient for making accurate predictions. The index-based encoding, although lossless, does not outperform the lossy encoding schemes. In fact, in several cases (e.g., in bpic2012_1, bpic2012_3, hospital_2, sepsis_2, traffic) it performs even worse than the last state encoding. This suggests that in the given datasets, the order of events is not as relevant for determining the final outcome of the case. Instead, combining the knowledge from all events performed so far provides much more signal. Alternatively, it may be that the order of events (i.e., the control-flow) does matter in some cases, but the classifiers considered in this study (including XGBoost) are not able to infer high-level control-flow features by itself, which would explain why we see that even the simple aggregation-based methods outperform index-based encoding. This phenomenon deserves a separate in-depth study.

The choice of the bucketing method seems to have a smaller effect on the results than the sequence encoding. Namely, the best results are usually achieved using the aggregation encoding with either single bucket, clustering, or prefix length based bucketing. The state-based bucketing results vary a lot across different datasets. While in some datasets (e.g., insurance, bpic2012, bpic2017) it easily achieves performance on the same level as other aggregation encodings, in other datasets (e.g., bpic2011 and bpic2015) the performance is rather spiky across different prefix lengths. A possible explanation for this is that in the bpic2011 and bpic2015 datasets the number of different event classes (and, therefore, the number of states) is much larger than in the insurance, bpic2012, and bpic2017 datasets. As a result, each classifier in state-based bucketing receives a small number of traces for training, which in turn causes the predictions to be less reliable (see the counts of training prefix traces in each bucket in Figures 15-16 in Appendix). The same phenomenon can be seen in case of prefix_agg, which usually achieves very good performance, but at times can produce unexpectedly inaccurate results (like in the longer prefixes of bpic2011_1 and bpic2012_2). Furthermore, the performance of single_agg and cluster_agg is very comparable, with both methods producing stable and reliable results on all datasets and across all prefix sizes. The optimal number of clusters in case of cluster_agg with XGBoost was often found to be between 2-6 (see Table 11 in Appendix), which explains why these two methods behave similarly. In some cases where the optimized number of clusters was higher, e.g., sepsis_1, bpic2012_1, and hospital_2, the accuracy of cluster_agg drops compared to single_agg.

These observations, further supported by the fact that KNN never appears among the top performing methods, lead to the conclusion that it is preferable to build few classifiers (or even just a single one), with a larger number of traces as input. XGBoost seems to be a classifier sophisticated enough to derive the “bucketing patterns” by itself when necessary. Another advantage of the single_agg method over its closest competitor, the cluster_agg, is the simplicity of a single bucket. In fact, no additional preprocessing step for bucketing the prefix traces is needed. On the other hand, clustering (regardless of the clustering algorithm) comes with a set of parameters, such as the number of clusters in k-means, that need to be tuned for optimal performance. Therefore, the time and effort needed from the user of the system for setting up the prediction framework can be considerably higher in case of cluster_agg, which makes single_agg the overall preferred choice of method in terms of accuracy and earliness. This discussion concludes the answer to RQ3.1.

5.4. Results: time performance

The time measurements for all of the methods and classifiers, calculated as averages over 5 identical runs using the final (optimal) parameters, are presented in Tables 9 and 10 (XGBoost), Tables 16 and 17 (RF), Tables 18 and 19 (logit), and Tables 20 and 21 (SVM). In the offline phase, the fastest of the four classifiers is logit. The ordering of the others differs between the small (production, bpic2011, bpic2015, insurance, and sepsis) and the large (bpic2017, traffic, hospital) datasets. In the former group, the second fastest classifier is SVM, usually followed by RF and, then, XGBoost. Conversely, in the larger datasets, XGBoost appears to scale better than the others, while SVM tends to be the slowest of the three. In terms of online time, logit, SVM, and XGBoost yield comparable performance, while RF is usually slower than the others. In the following, we will, again, analyse deeper the results obtained with the XGBoost classifier.

Recall that the KNN method (almost) skips the offline phase, since all the classifiers are built at runtime. The offline time for KNN still includes the time for constructing the prefix log and setting up the matrix of encoded historical prefix traces, which is later used for finding the nearest neighbors for running traces. Therefore, the offline times in case of the KNN approaches are almost negligible. The offline phase for the other methods (i.e., excluding KNN) takes between 3 seconds on the smallest dataset (production) to 6 hours and 30 minutes on hospital_1. There is no clear winner between the last state encoding and the corresponding aggregation encoding counterparts, which indicates that the time for applying the aggregation functions is small compared to the time taken for training the classifier. The most time in the offline phase is, in general, taken by index-based encoding that constructs the sequences of events for each trace.

In terms of bucketing, the fastest approach in the offline phase is usually state-based bucketing, followed by either the prefix length or the clustering based method, while the slowest is single bucket. This indicates that the time taken to train multiple (“small”) classifiers, each trained with only a subset of the original data, is smaller than training a few (“large”) classifiers using a larger portion of the data.

In general, all methods are able to process an event in less than 100 milliseconds during the online phase (the times in Tables 9 and 10 are in milliseconds per processed event in a partial trace). Exceptions are hospital_1 and hospital_2, where processing an event takes around 0.3-0.4 seconds. The online execution times are very comparable across all the methods, except for KNN and prefix_index. While prefix_index often takes double the time of other methods, the patterns for KNN are less straightforward. Namely, in some datasets (bpic2012, sepsis, production, insurance, and traffic), the KNN approaches take considerably more time than the other techniques, which can be explained by the fact that these approaches train a classifier at runtime. However, somewhat surprisingly, in other datasets (hospital and bpic2011 datasets) the KNN approaches yield the best execution times even at runtime. A possible explanation for this is that in cases where all the selected nearest neighbors are of the same class, no classifier is trained and the class of the neighbors is immediately returned as the prediction. However, note that the overall AUC in these cases is 7-21 percentage points lower than that of the best method (8). In the offline phase, the overhead of applying aggregation functions becomes more evident, with the last state encoding almost always outperforming the aggregation encoding methods by a few milliseconds. The fastest method in the online phase tends to be prefix_laststate, which outperforms the others in 17 out of 24 datasets. It is followed by knn_laststate, state_laststate, and single_laststate.

max width= bpic2011_1 bpic2011_2 bpic2011_3 method offline_total (s) online_avg (ms) offline_total (s) online_avg (ms) offline_total (s) online_avg (ms) single_laststate single_agg knn_laststate knn_agg state_laststate state_agg cluster_laststate cluster_agg prefix_index prefix_laststate prefix_agg bpic2011_4 bpic2015_1 bpic2015_2 method offline_total (s) online_avg (ms) offline_total (s) online_avg (ms) offline_total (s) online_avg (ms) single_laststate single_agg knn_laststate knn_agg state_laststate state_agg cluster_laststate cluster_agg prefix_index prefix_laststate prefix_agg bpic2015_3 bpic2015_4 bpic2015_5 method offline_total (s) online_avg (ms) offline_total (s) online_avg (ms) offline_total (s) online_avg (ms) single_laststate single_agg knn_laststate knn_agg state_laststate state_agg cluster_laststate cluster_agg prefix_index prefix_laststate prefix_agg production insurance_1 insurance_2 method offline_total (s) online_avg (ms) offline_total (s) online_avg (ms) offline_total (s) online_avg (ms) single_laststate single_agg knn_laststate knn_agg state_laststate state_agg cluster_laststate cluster_agg prefix_index prefix_laststate prefix_agg

Table 9. Execution times for XGBoost (1)

max width= sepsis_1 sepsis_2 sepsis_3 method offline_total (s) online_avg (ms) offline_total (s) online_avg (ms) offline_total (s) online_avg (ms) single_laststate single_agg knn_laststate knn_agg state_laststate state_agg cluster_laststate cluster_agg prefix_index prefix_laststate prefix_agg bpic2012_1 bpic2012_2 bpic2012_3 method offline_total (s) online_avg (ms) offline_total (s) online_avg (ms) offline_total (s) online_avg (ms) single_laststate single_agg knn_laststate knn_agg state_laststate state_agg cluster_laststate cluster_agg prefix_index prefix_laststate prefix_agg bpic2017_1 bpic2017_2 bpic2017_3 method offline_total (s) online_avg (ms) offline_total (s) online_avg (ms) offline_total (s) online_avg (ms) single_laststate single_agg knn_laststate knn_agg state_laststate state_agg cluster_laststate cluster_agg prefix_index prefix_laststate prefix_agg traffic hospital_1 hospital_2 method offline_total (s) online_avg (ms) offline_total (s) online_avg (ms) offline_total (s) online_avg (ms) single_laststate single_agg knn_laststate knn_agg state_laststate state_agg cluster_laststate cluster_agg prefix_index prefix_laststate prefix_agg

Table 10. Execution times for XGBoost (2)

In terms of online execution times, the observed patterns are in line with those of other classifiers. However, there are some differences in the offline phase. Namely, in case of RF, the single classifiers perform relatively better as compared to bucketing methods. Furthermore, the difference between the encoding methods becomes more evident, with the last state encodings usually outperforming their aggregation encoding counterparts. The index-based encoding is still the slowest of the techniques. In case of logit, all the methods achieve comparable offline times, except for index-based encoding and the clustering based bucketings, which are slower than the others. In case of SVM, the single_laststate method tends to be much slower than other techniques. This discussion concludes the answer to RQ3.2.

5.5. Results: gap-based filtering

In order to investigate the effects of gap-based filtering on the execution times and the accuracy, we selected 4 methods based on their performance in the above subsections: single_agg, single_laststate, prefix_index, and prefix_agg. The first three of these methods were shown to take the most time in the offline phase, i.e., they have the most potential to benefit from a filtering technique. Also, single_agg and prefix_agg achieved the highest overall AUC scores, which makes them the most attractive candidates to apply in practice. Furthermore, we selected 6 datasets which are representative in terms of their sizes (i.e., number of traces), consist of relatively long traces on average, and did not yield a very high accuracy very early in the trace.

We can see in Figure 7 that using yields an improvement of about 2-3 times in the offline execution times, while using , the improvement is usually around 3-4 times, as compared to no filtering (). For instance, in case of single_agg on the bpic2017_2 dataset with , this means that the offline phase takes about 30 minutes instead of 2 hours. At the same time, the overall AUC remains at the same level, sometimes even increasing when a filtering is applied (Figure 8). On the other hand, the gap-based filtering only has a marginal (positive) effect on the online execution times, which usually remain on the same level as without filtering (Figure 9). This concludes the answer to RQ3.3.

Figure 7. Offline times across different gaps (XGBoost)
Figure 8. AUC across different gaps (XGBoost)
Figure 9. Online times across different gaps (XGBoost)

5.6. Results: categorical domain filtering

To answer RQ3.4, we proceed with the 4 methods as discussed in the previous subsection. To better investigate the effect of filtering the categorical attribute levels, we distinguish between the static and the dynamic categorical attributes. For investigating the effects of dynamic categorical domain filtering, we selected 9 datasets that contain a considerable number of levels in the dynamic categorical attributes.

Both the offline (Figure 10) and the online (Figure 12) execution times tend to increase linearly when the proportion of levels is increased. As expected, the prefix_index method benefits the most from the filtering, since the size of the feature vector increases more rapidly than in the other methods when more levels are added (the vector contains one feature per level per event). Although the overall AUC is negatively affected by the filtering of levels (see Figure 11), reasonable tradeoffs can still be found. For instance, when using 50% of the levels in case of single_agg on the hospital_2 dataset, the AUC is almost unaffected, while the training time has decreased by more than 30 minutes and the online execution times have decreased by a half.

Figure 10. Offline times across different filtering proportions of dynamic categorical attribute levels (XGBoost)
Figure 11. AUC across different filtering proportions of dynamic categorical attribute levels (XGBoost)
Figure 12. Online times across different filtering proportions of dynamic categorical attribute levels (XGBoost)

We performed similar experiments by filtering the static categorical attribute domains, selecting 6 datasets that contain a considerable number of levels in these attributes. However, the improvement in execution times were marginal compared to those obtained when using dynamic attribute filtering (see Figures 20-22 in Appendix). This is natural, since the static attributes have a smaller effect on the size of the feature vector (each level occurs in the vector only once). This concludes the answer to RQ3.4.

6. Threats to validity

One of the threats to the validity of this study relates to the potential selection bias in the literature review. To minimize this, we described our systematic literature review procedure on a level of detail that is sufficient to replicate the search. However, in time the search and ranking algorithms of the used academic database (Google Scholar) might be updated and return different results. Another potential source of bias is the subjectivity when applying inclusion and exclusion criteria, as well as when determining the primary and subsumed studies. In order to alleviate this issue, all the included papers were collected in a publicly available spreadsheet, together with decisions and reasons about excluding them from the study. Moreover, each paper was independently assessed against the inclusion and exclusion criteria by two authors, and inconsistencies were resolved with the mediation of a third author.

Another threat to validity is related to the comprehensiveness of the conducted experiments. In particular, only one clustering method was tested, a single state abstraction was used when building the transition systems for state-based bucketing, and four classification algorithms were applied. It is possible that there exists, for example, a combination of an untested clustering technique and a classifier that outperforms the settings used in this study. Also, although the hyperparameters were optimized using a state-of-the-art hyperparameter optimization technique, it is possible that using more iterations for optimization or a different optimization algorithm, other parameter settings would be found that outperform the settings used in the current evaluation. Furthermore, the generalizability of the findings is to some extent limited by the fact that the experiments were performed on a limited number of prediction tasks (24), constructed from nine event logs. Although these are all real-life event logs from different application fields that exhibit different characteristics, it may be possible that the results would be different using other datasets or different log preprocessing techniques for the same datasets. In order to mitigate these threats, we built an open-source software framework which allows the full replication of the experiments, and made this tool publicly available. Moreover, additional datasets, as well as new sequence classification and encoding methods can be plugged in. So the framework can be used for future experiments. Also, the preprocessed datasets constructed from the three publicly available event logs are included together with the tool implementation in order to enhance the reproducibility of the experiments.

7. Conclusion

This study provided a survey and comparative analysis and evaluation of existing outcome-oriented predictive business process monitoring techniques. The relevant existing studies were identified through a systematic literature review (SLR), which revealed 14 studies (some described across multiple papers) dealing with the problem of predicting case outcomes. Out of these, seven were considered to contain a distinct contribution (primary studies). Through further analysis of the primary studies, a taxonomy was proposed based on two main aspects, the trace bucketing approach and sequence encoding method employed. Combinations of these two aspects led to a total of 11 distinct methods.

The studies were characterized from different perspectives, resulting in a taxonomy of existing techniques. Finally, a comparative evaluation of the 11 identified techniques was performed using a unified experimental set-up and 24 predictive monitoring tasks constructed from 9 real-life event logs. To ensure a fair evaluation, all the selected techniques were implemented as a publicly available consolidated framework, which is designed to incorporate additional datasets and methods.

The results of the benchmark show that the most reliable and accurate results (in terms of AUC) are obtained using a lossy (aggregation) encoding of the sequence, e.g., the frequencies of performed activites rather than the ordered activities. One of the main benefits of this encoding is that it enables to represent all prefix traces, regardless of their length, in the same number of features. This way, a single classifier can be trained over all of the prefix traces, allowing the classifier to derive meaningful patterns by itself. These results disprove the existing opinion in the literature about the superiority of a lossless encoding of the trace (index-based encoding) that requires prefixes to be divided into buckets according to their length, while multiple classifiers are trained on each such subset of prefixes.

The study paves the way to several directions of future work. In Section 2 we noted that case and event attributes can be of categorical, numeric or textual type. The systematic review showed that existing methods are focused on handling categorical and numeric attributes, to the exclusion of textual ones. Recent work has shown how text mining techniques can be used to extend the index-based encoding approach of (Teinemaa et al., 2016) in order to handle text attributes, however this latter work considered a reduced set of text mining techniques and has only been tested on two datasets of relatively small size and complexity.

Secondly, the methods identified in the survey are mainly focused on extracting features from one trace at a time (i.e., intra-case features), while only a single inter-case feature (the number of open cases) is included. However, due to the fact that the ongoing cases of a process share the same pool of resources, the outcome of a case may depend also on other aspects of the current state of the rest of ongoing cases in the process. Therefore, the accuracy of the models tested in this benchmark could be further improved by using a larger variety of inter-case features.

Lastly, as long-short term memory (LSTM) networks have recently gained attention in predicting remaining time and next activity of a running case of a business process 

(Evermann et al., 2016; Tax et al., 2017), another natural direction for future work is to study how LSTMs can be used for outcome prediction. In particular, could LSTMs automatically derive relevant features from collections of trace prefixes, and thus obviate the need for sophisticated feature engineering (aggregation functions), which has been so far the focus of predictive process monitoring research?

Acknowledgements.
This research is partly funded by the Sponsor Australian Research Council Rl (grant Grant #3) and the  Sponsor Estonian Research Council Rl (grant Grant #3).

References

  • (1)
  • Bergstra et al. (2011) James S Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. 2011. Algorithms for hyper-parameter optimization. In Proc. of NIPS. 2546–2554.
  • Bradley (1997) Andrew P Bradley. 1997. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern recognition 30, 7 (1997), 1145–1159.
  • Breiman (2001) Leo Breiman. 2001. Random forests. Machine learning 45, 1 (2001), 5–32.
  • Castellanos et al. (2005) Malu Castellanos, Norman Salazar, Fabio Casati, Umesh Dayal, and Ming-Chien Shan. 2005. Predictive business operations management. In International Workshop on Databases in Networked Information Systems. Springer, 1–14.
  • Conforti et al. (2013) Raffaele Conforti, Massimiliano De Leoni, Marcello La Rosa, and Wil MP Van Der Aalst. 2013. Supporting risk-informed decisions during business process execution. In International Conference on Advanced Information Systems Engineering. Springer, 116–132.
  • Conforti et al. (2015) Raffaele Conforti, Massimiliano de Leoni, Marcello La Rosa, Wil MP van der Aalst, and Arthur HM ter Hofstede. 2015. A recommendation system for predicting risks across multiple business process instances. Decision Support Systems 69 (2015), 1–19.
  • De Leoni et al. (2014) Massimiliano De Leoni, Wil MP van der Aalst, and Marcus Dees. 2014. A general framework for correlating business process characteristics. In International Conference on Business Process Management. Springer, 250–266.
  • de Leoni et al. (2016) Massimiliano de Leoni, Wil MP van der Aalst, and Marcus Dees. 2016. A general process mining framework for correlating, predicting and clustering dynamic behavior based on event logs. Information Systems 56 (2016), 235–257.
  • Di Francescomarino et al. (2017) Chiara Di Francescomarino, Marlon Dumas, Fabrizio M Maggi, and Irene Teinemaa. 2017. Clustering-based predictive process monitoring. IEEE Transactions on Services Computing (2017).
  • Dumas et al. (2013) Marlon Dumas, Marcello La Rosa, Jan Mendling, and Hajo A. Reijers. 2013. Fundamentals of Business Process Management. Springer.
  • Evermann et al. (2016) Joerg Evermann, Jana-Rebecca Rehse, and Peter Fettke. 2016.

    A Deep Learning Approach for Predicting Process Behaviour at Runtime. In

    Proceedings of the Business Process Management Workshops. Springer, 327–338.
  • Fernández-Delgado et al. (2014) Manuel Fernández-Delgado, Eva Cernadas, Senén Barro, and Dinani Amorim. 2014. Do we need hundreds of classifiers to solve real world classification problems. J. Mach. Learn. Res 15, 1 (2014), 3133–3181.
  • Folino et al. (2014) Francesco Folino, Massimo Guarascio, and Luigi Pontieri. 2014. Mining predictive process models out of low-level multidimensional logs. In International conference on advanced information systems engineering. Springer, 533–547.
  • Friedman (2001) Jerome H Friedman. 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics (2001), 1189–1232.
  • Ghalwash and Obradovic (2012) Mohamed F Ghalwash and Zoran Obradovic. 2012. Early classification of multivariate temporal observations by extraction of interpretable shapelets. BMC bioinformatics 13, 1 (2012), 195.
  • Ghalwash et al. (2013) Mohamed F Ghalwash, Vladan Radosavljevic, and Zoran Obradovic. 2013. Extraction of interpretable multivariate patterns for early diagnostics. In Data Mining (ICDM), 2013 IEEE 13th International Conference on. IEEE, 201–210.
  • Ghattas et al. (2014) Johny Ghattas, Pnina Soffer, and Mor Peleg. 2014. Improving business process decision making based on past experience. Decision Support Systems 59 (2014), 93–107.
  • He et al. (2015) Guoliang He, Yong Duan, Rong Peng, Xiaoyuan Jing, Tieyun Qian, and Lingling Wang. 2015. Early classification on multivariate time series. Neurocomputing 149 (2015), 777–787.
  • Kitchenham (2004) Barbara Kitchenham. 2004. Procedures for performing systematic reviews. Keele, UK, Keele University 33, 2004 (2004), 1–26.
  • Lakshmanan et al. (2010) Geetika T Lakshmanan, Songyun Duan, Paul T Keyser, Francisco Curbera, and Rania Khalaf. 2010. Predictive analytics for semi-structured case oriented business processes. In International Conference on Business Process Management. Springer, 640–651.
  • Leontjeva et al. (2015) Anna Leontjeva, Raffaele Conforti, Chiara Di Francescomarino, Marlon Dumas, and Fabrizio Maria Maggi. 2015. Complex symbolic sequence encodings for predictive monitoring of business processes. In International Conference on Business Process Management. Springer, 297–313.
  • Lin et al. (2015) Yu-Feng Lin, Hsuan-Hsu Chen, Vincent S Tseng, and Jian Pei. 2015. Reliable early classification on multivariate time series with numerical and categorical attributes. In Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 199–211.
  • Maggi et al. (2014) Fabrizio Maria Maggi, Chiara Di Francescomarino, Marlon Dumas, and Chiara Ghidini. 2014. Predictive monitoring of business processes. In International Conference on Advanced Information Systems Engineering. Springer, 457–472.
  • Metzger et al. (2012) Andreas Metzger, Rod Franklin, and Yagil Engel. 2012. Predictive Monitoring of Heterogeneous Service-Oriented Business Networks: The Transport and Logistics Case. In 2012 Annual SRII Global Conference. IEEE Computer Society, 313–322.
  • Metzger et al. (2015) Andreas Metzger, Philipp Leitner, Dragan Ivanovic, Eric Schmieders, Rod Franklin, Manuel Carro, Schahram Dustdar, and Klaus Pohl. 2015. Comparing and Combining Predictive Business Process Monitoring Techniques. IEEE Trans. Systems, Man, and Cybernetics: Systems 45, 2 (2015), 276–290.
  • Olson et al. (2017) Randal S Olson, William La Cava, Zairah Mustahsan, Akshay Varik, and Jason H Moore. 2017. Data-driven advice for applying machine learning to bioinformatics problems. arXiv preprint arXiv:1708.05070 (2017).
  • Pedregosa et al. (2011) F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830.
  • Pnueli (1977) Amir Pnueli. 1977. The temporal logic of programs. In Foundations of Computer Science, 1977., 18th Annual Symposium on. IEEE, 46–57.
  • Rogge-Solti and Weske (2013) Andreas Rogge-Solti and Mathias Weske. 2013. Prediction of Remaining Service Execution Time Using Stochastic Petri Nets with Arbitrary Firing Delays. In International Conference on Service-Oriented Computing (ICSOC). Springer, 389–403.
  • Rozumnyi (2017) Andrii Rozumnyi. 2017. A Dashboard-based Predictive Process Monitoring Engine. Master’s thesis. University of Tartu.
  • Salfner et al. (2010) Felix Salfner, Maren Lenk, and Miroslaw Malek. 2010. A survey of online failure prediction methods. ACM Computing Surveys (CSUR) 42, 3 (2010), 10.
  • Schwegmann et al. (2013a) Bernd Schwegmann, Martin Matzner, and Christian Janiesch. 2013a. A Method and Tool for Predictive Event-Driven Process Analytics.. In Wirtschaftsinformatik. Citeseer, 46.
  • Schwegmann et al. (2013b) Bernd Schwegmann, Martin Matzner, and Christian Janiesch. 2013b. preCEP: facilitating predictive event-driven process analytics. In International Conference on Design Science Research in Information Systems. Springer, 448–455.
  • Senderovich et al. (2017) Arik Senderovich, Chiara Di Francescomarino, Chiara Ghidini, Kerwin Jorbina, and Fabrizio Maria Maggi. 2017. Intra and inter-case features in predictive process monitoring: A tale of two dimensions. In International Conference on Business Process Management. Springer, 306–323.
  • Tax et al. (2017) Niek Tax, Ilya Verenich, Marcello La Rosa, and Marlon Dumas. 2017.

    Predictive Business Process Monitoring with LSTM Neural Networks. In

    International Conference on Advanced Information Systems Engineering (CAiSE). Springer, 477–492.
  • Teinemaa et al. (2016) Irene Teinemaa, Marlon Dumas, Fabrizio Maria Maggi, and Chiara Di Francescomarino. 2016. Predictive Business Process Monitoring with Structured and Unstructured Data. In International Conference on Business Process Management. Springer, 401–417.
  • van der Aalst (2016) Wil MP van der Aalst. 2016.

    Process mining: data science in action

    .
    Springer.
  • Van Der Aalst et al. (2010) W MP Van Der Aalst, Vladimir Rubin, H MW Verbeek, Boudewijn F van Dongen, Ekkart Kindler, and Christian W Günther. 2010. Process mining: a two-step approach to balance between underfitting and overfitting. Software and Systems Modeling 9, 1 (2010), 87–111.
  • Van Der Spoel et al. (2012) Sjoerd Van Der Spoel, Maurice Van Keulen, and Chintan Amrit. 2012. Process prediction in noisy data sets: a case study in a dutch hospital. In International Symposium on Data-Driven Process Discovery and Analysis. Springer, 60–83.
  • van Dongen et al. (2008) Boudewijn F van Dongen, Ronald A Crooy, and Wil MP van der Aalst. 2008. Cycle time prediction: When will this case finally be finished?. In OTM Confederated International Conferences” On the Move to Meaningful Internet Systems”. Springer, 319–336.
  • Verenich et al. (2015) Ilya Verenich, Marlon Dumas, Marcello La Rosa, Fabrizio Maria Maggi, and Chiara Di Francescomarino. 2015. Complex symbolic sequence clustering and multiple classifiers for predictive process monitoring. In International Conference on Business Process Management. Springer, 218–229.
  • Xing and Pei (2010) Zhengzheng Xing and Jian Pei. 2010. Exploring Disease Association from the NHANES Data: Data Mining, Pattern Summarization, and Visual Analytics. IJDWM 6, 3 (2010), 11–27. https://doi.org/10.4018/jdwm.2010070102
  • Xing et al. (2008) Zhengzheng Xing, Jian Pei, Guozhu Dong, and Philip S Yu. 2008. Mining sequence classifiers for early prediction. In Proceedings of the 2008 SIAM international conference on data mining. SIAM, 644–655.

Appendix

This Appendix reports the following:

  • The distributions of case lengths in different outcome classes (Figures 13-14);

  • The optimal number of clusters (Table 11) and the optimal number of neighbors for KNN approaches (Table 12) found for each classifier;

  • The distributions of bucket sizes for the different bucketing methods (Figures 15-16);

  • The overall AUC values for RF (Table 13), logit(Table 14), and SVM(Table 15);

  • The AUC scores across prefix lengths using XGBoost classifier and all of the compared methods (Figures 17-18);

  • The AUC scores across prefix lengths, including long traces only, using the XGBoost classifier (Figure 19);

  • The execution times for RF (Tables 16-17), logit(Tables 18-19), and SVM(Tables 20-21);

  • The offline (Figure 20) and online (Figure 21) execution times and the overall AUC scores (Figure 22) when filtering the static categorical attribute domain, using the XGBoost classifier.

Figure 13. Case length histograms for positive and negative classes (1)
Figure 14. Case length histograms for positive and negative classes (2)

max width= RF XGBoost Logit SVM dataset cluster_last cluster_agg cluster_last cluster_agg cluster_last cluster_agg cluster_last cluster_agg bpic2011_1 10 8 10 6 24 23 15 43 bpic2011_2 28 4 3 6 20 13 27 24 bpic2011_3 30 4 28 4 33 13 32 44 bpic2011_4 2 21 2 2 16 2 24 36 insurance_2 8 12 2 2 4 3 30 25 insurance_1 6 18 3 2 10 47 45 3 bpic2015_1 39 10 37 4 21 2 13 7 bpic2015_2 32 6 31 5 42 7 9 13 bpic2015_3 44 12 36 10 41 11 11 13 bpic2015_4 45 3 47 5 47 40 19 8 bpic2015_5 43 4 49 19 32 4 8 4 production 44 21 18 2 38 44 10 7 sepsis_1 32 39 36 30 39 41 41 38 sepsis_2 3 4 3 2 7 3 10 28 sepsis_3 10 9 3 3 7 23 21 15 bpic2012_1 22 7 3 35 3 3 8 49 bpic2012_2 9 9 3 4 7 9 15 3 bpic2012_3 10 26 3 2 13 8 22 15 bpic2017_1 39 30 22 43 4 34 39 19 bpic2017_2 11 10 20 15 31 27 40 4 bpic2017_3 29 30 32 34 19 47 21 35 traffic 42 43 29 23 42 36 9 13 hospital_1 35 2 33 48 10 8 48 32 hospital_2 19 48 33 45 11 8 34 28

Table 11. Best number of clusters

max width= RF XGBoost Logit SVM dataset knn_last knn_agg knn_last knn_agg knn_last knn_agg knn_last knn_agg bpic2011_1 47 45 50 50 46 48 49 39 bpic2011_2 45 47 50 46 26 21 42 40 bpic2011_3 50 46 50 46 45 32 44 42 bpic2011_4 40 41 43 46 44 50 16 32 insurance_2 46 47 50 45 48 49 32 44 insurance_1 45 49 44 50 29 36 16 12 bpic2015_1 31 49 50 45 32 17 12 3 bpic2015_2 48 50 46 46 41 12 11 2 bpic2015_3 29 48 46 46 40 49 2 3 bpic2015_4 30 43 50 36 13 9 3 38 bpic2015_5 30 37 46 50 27 47 2