DREAM-NAP: Decay Replay Mining to Predict Next Process Activities

by   Julian Theis, et al.

In complex processes, various events can happen in different sequences. The prediction of the next event activity given an a-priori process state is of importance in such processes. Recent methods leverage deep learning techniques such as recurrent neural networks to predict event activities from raw process logs. However, deep learning techniques cannot efficiently model logical behaviors of complex processes. In this paper, we take advantage of Petri nets as a powerful tool in modeling logical behaviors of complex processes. We propose an approach which first discovers Petri nets from event logs utilizing a recent process mining algorithm. In a second step, we enhance the obtained model with time decay functions to create timed process state samples. Finally, we use these samples in combination with token movement counters and Petri net markings to train a deep learning model that predicts the next event activity. We demonstrate significant performance improvements and outperform the state-of-the-art methods on eight out of nine real-world benchmark event logs in accuracy.



page 5

page 10


A Deep Learning Approach for Repairing Missing Activity Labels in Event Logs for Process Mining

Process mining is a relatively new subject that builds a bridge between ...

Masking Neural Networks Using Reachability Graphs to Predict Process Events

Decay Replay Mining is a deep learning method that utilizes process mode...

Workload Prediction of Business Processes – An Approach Based on Process Mining and Recurrent Neural Networks

Recent advances in the interconnectedness and digitization of industrial...

Studying Hadronization by Machine Learning Techniques

Hadronization is a non-perturbative process, which theoretical descripti...

What Averages Do Not Tell – Predicting Real Life Processes with Sequential Deep Learning

Deep Learning is proven to be an effective tool for modeling sequential ...

Conformance Checking of Mixed-paradigm Process Models

Mixed-paradigm process models integrate strengths of procedural and decl...

HiPAL: A Deep Framework for Physician Burnout Prediction Using Activity Logs in Electronic Health Records

Burnout is a significant public health concern affecting nearly half of ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

With the ongoing development of digitizing and automatizing industries along with the steady increment of interconnected devices, we are able to project more interactions onto processes [1, 2]. These processes can represent procedures in different industries such as retail [3], software development [4], healthcare [5], network management [6], project management [7], or manufacturing [8]. One illustrative example is the process of a customer loan application in financial institutes [9]. An applicant can request money for specific purposes. The application then undergoes several process activities such as negotiation, request validation, fraud assessment, offer creation and/or application rejection. Each activity of the process utilizes different institutional resources such as employees, customer records, IT systems, or third-party resources to check the creditworthiness of applicants. Though trivial, the process gets complex with an increasing number of applications and requirements of the institute.

While traditional process mining is primarily concerned with the discovery, analysis and monitoring of processes, predictive process management gains momentum by enhancing process models. Predictive process management plays an important role in the areas mentioned earlier. Knowing when specific situations occur, or in which state a process will be next, is important in order to meet qualitative and/or quantitative requirements of businesses and organizations.

Many businesses deploy Process-Aware Information Systems (PAIS) such as workflow management systems, case-handling systems, enterprise information systems, enterprise resource planning, and customer relationship management systems. These are software tools which manage and execute operational processes involving people, applications, and/or information sources based on process models [10]. Such systems record events stored in event logs which can be utilized for predictive process management. Typical use cases comprise the prediction of the next event, forecasting of a process’ final state, or time interval prediction of future events [11]. Predicting the next event elicit special attention since it gives organizations the ability to forecast process deviations. This type of early detection is essential for intervenability before a process enters risky states [12]. Moreover, predictive process management assists businesses in resource planning and allocation, and provides insights on the condition of a process to fulfil for instance service-level agreements [13, 14, 15].

With this motivation, a range of different methods have been proposed on predicting the events in process sequences. Most recent advances are made in utilizing different deep learning architectures such as Long Short-Term Memory (LSTM) neural networks and stacked autoencoders

[16, 15, 17, 18]. However, these techniques do not discover process models at first, but perform their predictions on the raw event logs. This makes decision making hard to understand and difficult to explain, which is crucial to discover the weaknesses of a process. Furthermore, since neural networks are not infallible [19], commonsense knowledge and obvious logical policies must be introduced into a deep learning model from the beginning to reduce potential vulnerability. This knowledge is easy to obtain from process discovery algorithms. Therefore, modeling processes from scratch using neural networks is costly and partially redundant. Thus, one of the research questions is how to retain process models like Petri nets (PN) [20] with its logic and combine it with the strengths of deep learning towards more understandable models to improve performance at the same time.

Further research motivation arises due to limitations of recent methods. First, LSTMs are lacking abilities to memorize distant events in running process cases [21, 22, 23]. This is an issue in long-running process cases in which a distant event might be the reason for a later occurring one. Second, most recent techniques in this area do not consider time as a continuous feature, thus losing information by discretization. However, the actual duration between two events might be correlated with the type of a later occurring event, similar to deadline-based time PNs [24, 25].

In the current work, we propose an innovative method to predict the activity of the next event of a running process case which engages with the issues mentioned above. We first leverage a state-of-the-art process mining algorithm to discover a PN based process model from an event log. Then, we enhance the process model with time decay functions. In this way, we can create timed-state samples which we finally couple with process resources to train a neural network for the prediction of the activity of the next event. We call this approach Decay Replay Mining - Next Activity Prediction (DREAM-NAP). By taking this approach, we demonstrate significant improvements. Our method outperforms the state-of-the-art techniques on most of the popular benchmark datasets in accuracy. Specifically, we achieve first places on eight out of nine datasets and a second place on the remaining event log.

This paper is structured as follows. Section II discusses related work and most recent advances in the next event prediction of business processes. We introduce preliminaries in Section III. Section IV focuses on the proposed approach, especially on the decay function modeling in PNs and the deep learning architecture. Section V evaluates the approach against different existing methods. Finally, we conclude the paper and discuss the future work in Section VI.

Ii Related Work

The application of deep learning on predictive business process mining has grown enormously during recent years. Researchers have shown the applicability of machine and deep learning on several target variables such as the remaining time of running cases [26], forecasting time of events [27], and predicting upcoming events in running processes while utilizing a-priori knowledge [28]

. The prediction of event activities can be considered as a classification problem in which the probability of a next event activity

given the state of the process at time , , is to be found.

Early predictive models focused on analytical approaches. Le et al. [29]

introduced a hybrid approach consisting of a sequence alignment technique to extract similar patterns and to predict upcoming event activities based on a combination of Markov models. The next event activity of a running process case is therefore determined by the transition probabilities of the Markov models.

Becker et al. [30] faced this problem with a similar approach in which historical event data is used to create a Probabilistic Finite Automaton. In comparison, Ceci et al. [31] proposed an approach which can handle incomplete traces which is robust to noise and deals with overfitting. This approach leverages sequence mining. Efficient frequent pattern mining is applied to create a tree where prediction models are associated to each node (also called nested model learning

). These prediction models can be any traditional machine learning algorithms for classification.

Lakshmanan et al. [32]

developed a method which models a process in a probabilistic and instance-specific way. This model is able to predict next event activities and can be translated into a Markov chain. Their approach has been implemented on a simulated automobile insurance claim process.

Similarly, Unuvar et al. [33] proposed a method to predict the likelihood of future process tasks by modeling parallel paths which can be either dependent or independent. The authors applied their methodology to a simulated marketing campaign business process model.

More recently, Breuker et al. [12] introduced a predictive model based on the theory of grammatical inference. They have modeled business processes probabilistically with a method called RegPFA which is based on Probabilistic Finite Automaton. Grammatical inference is applied on top of the finite automaton. One of the advantages is that the methodology is based on weaker biases while maintaining comprehensibility. This is important, because users without deep technical knowledge can interpret and understand the models. Breuker et al. evaluated their approach against two public available real-world logs demonstrating significant performance improvements. Breuker et al. is able to predict the next event activity given a running process case with accuracies between and .

Most recent research studies have shown the applicability of deep learning to predict process states and events. Evermann et al. [16] have shown in 2017 that neural networks, especially recurrent neural networks, can be applied to predict next event activities in processes and improves state-of-the-art prediction accuracies. They create word embeddings from each event of the event log to train an LSTM neural network. Therefore, the process is modeled implicitly by the neural network itself. Evermann et al. used the same datasets as Breuker et al. [12] for comparison.

A comparable approach has been elaborated by Tax et al. [15]

who predict the next events including their timestamps and remaining case times using LSTMs. This approach is similar to the one Evermann et al. demonstrated before. A major drawback of LSTMs in this context is their limited memory due to the predefined size of the memory state representation which is used to predict next events. Distant events in long running cases vanish over time from the memory state vector

[21, 22, 23].

An adaptation has been made by Khan et al. [17] to overcome the memory limitations of LSTMs by applying memory-augmented neural networks. This technique leverages external memory modules for long-term retention to model complex event processes. The authors demonstrate the applicability and report slight performance improvements compared to Tax et al. [15].

A further approach has been elaborated by Mehdiyev et al. [18]

. The authors encode events into n-gram features using a sliding window approach and leverage feature hashing on top. These features, in turn, are used to train a deep learning model consisting of unsupervised stacked autoencoders and supervised fine tuning. This architecture has shown significant performance improvements across most of the datasets, yet it is more complex compared to the methods described earlier. Mehdiyev et al. is able to predict the next event activity given a running process case with accuracies between

and .

Since deep learning techniques are difficult to interpret, Lee et al. [34] developed a method based on matrix factorization and knowledge from business process management to create predictive models which are easier to understand. The authors claim to require fewer parameters than neural networks while maintaining good performance.

In this work, we have two major contributions. First, we incorporate time as a continuous feature since the duration between two events might be correlated with the type of subsequent occurring events in real-world processes. Second, we overcome the short-term memory limitations when applying LSTMs using a decay function mechanism. With these two advancements, we show that our next event prediction algorithm performs significantly better than the previously introduced methods.

Iii Preliminaries

In this section, we introduce the preliminaries which are required throughout the paper. We introduce event logs, followed by PNs. We provide a general introduction to Process Mining in III-D and introduce a state-of-the-art process discovery algorithm in Section III-E. Finally, we define neural networks in Section III-F.

Iii-a Event Logs

In this section, we introduce the preliminaries which are required throughout the paper. We introduce event logs, followed by PNs. We provide a general introduction to Process Mining in III-D and introduce a state-of-the-art process discovery algorithm in Section III-E. Finally, we define neural networks in Section III-F.

Iii-B Event Logs

The definitions in this subsections are based on the work of van der Aalst et al. [35] and Guo et al. [36].

An event can be any observable action and is represented using a unique identifier . Each event is associated with an activity where is the set of all possible activities. Moreover, an event encompasses attributes such as a mandatory timestamp as well as further non-mandatory attributes like associated costs, people, and resources. As such, two events might be associated to the same activity and can carry the same non-mandatory attribute values. However, these events will never be considered to be identical since their timestamps are different. We define as the set of all possible event identifiers and as the set of all possible attribute names. Then for any event and any attribute is the value of the attribute for event . If an event does not encompass an attribute , then (null). We denote the attribute name timestamp by . Furthermore, we define a function mapping each event to an activity.

A case is a finite sequence of events and is uniquely represented by a case identifier . Each case can encompass its own attributes such as used resources, costs, or further application-specific information. In each case, an event can only occur once, based on our event definitions. In literature, the term trace is also used to describe a case. We define as the set of all possible trace identifiers and as the length of trace . The order of events in a trace has to satisfy the following constraint.


An event log is a set of traces such that each trace, and therefore each event, occurs only once. Moreover, refers to the th event in the th trace of an event log .

Iii-C Petri Net

A PN is a tool that can represent a process model. It consists of a set of places; these are graphically represented as circles and transitions represented as rectangles. Transitions correspond to event activities. Transitions and places are also referred to nodes. Additionally, arcs are used to unidirectionally connect places to transitions and vice versa. We define a PN as


where is the set of places, is the set of transitions and is the set of directed arcs connecting places and transitions [35, 36, 37]. The set is called the set of nodes. A node is the input node to another node iff . Consequently, is the output node to another node iff . For any , is the set of input nodes to and is the set of output nodes of . The function maps transitions to either event activities or to a non-observable activity . It is defined such that [37]


Each place can hold a non-negative integer number of tokens. We define as the number of tokens in a place where .

The state of a PN corresponds to a marking where is the set of all possible markings. We define as a vector of size where denotes the set of all non-negative integers and corresponds to the cardinality of . Each element where is the th place of . The initial state is also called initial marking, whereas the final state is called final marking [35]. For our purpose, at least one of the elements in both and must be greater than .

Moreover, a transition is enabled, i.e. can only be fired if


Hidden transitions, a special type of transition, are associated to non-observable activities . Such transitions can always fire independent of observed activities as long as the introduced token requirements at incoming places are met. When firing a transition , a token is removed from each of the input places , while a token is added to each of the output places .

A PN is considered sound iff for each trace [38]:

  • it is always possible to reach the final marking,

  • there are no remaining tokens other than when the final marking is reached,

  • and if it is possible to execute an arbitrary event starting from by following the appropriate route through the PN.

Furthermore, we define a function for all measuring the average time between a token leaves a place until a new token enters based on an input trace . Finally, describes the most recent time that a token entered a place .

Iii-D Process Mining

Process mining defines the discovery, conformance, and enhancement of business processes [35, 39]. Process discovery is the algorithmic extraction of process models from event logs. One can carry out analysis on obtained models which are usually in the format of PNs, Business Process Modeling Notations (BPMN), Event Driven Process Chains (EPCs), or Casual Nets (CN). In this paper, we will focus on PNs only.

Conformance is defined as the evaluation of the quality of a discovered process model, i.e. if it is a good representation of the process recorded by an event log. It is commonly evaluated based on fitness and precision among other metrics [35]. Therefore, each trace of an event log is replayed by executing the events sequentially on top of the process model. Fitness metric functions evaluate the quality of a process model by quantifying deviations between an event log and the replay response of a process model to this event log. A process model should allow to replay the behavior seen in the event log [35]. Precision metric functions represent the alignment between simulated traces from the obtained process model and true traces from the event log. Ideally, each generated trace by the process model should be realistic, thus being present in the actual event log.

Enhancement considers discovered process models as well as event logs to improve or extend the models. Examples of process enhancement include structural corrections to allow the occurrence of specific behavior or extending a process model with performance data.

Iii-E Split Miner

Split miner [40] is a process discovery algorithm which is characterized by recent significant performance improvements in comparison to existing state-of-the-art methods [41]. It is currently the best algorithm to automatically obtain PN process models from event logs with high fitness and precision. This discovery method has been developed to engage with the tradeoff between fitness, precision, and the complexity of the obtained process model.

Split miner consists of the following five steps [40]. First, it discovers a directly-follows dependency graph and detects short loops. In the second step, the algorithm searches for concurrency, and marks the respective elements as such. Afterwards, split miner applies filtering such that each node is on a path from a single start node to an end node to guarantee soundness, the number of edges are minimal to reduce complexity, and that every path from start to end has the highest possible sum of frequencies to maximize fitness. Fourth, the algorithm adds split gateways in order to capture choice and concurrency. As the final step, this discovery method detects joins.

Split miner

encompasses two hyperparameters: a frequency threshold

to control the filtering process and which is a threshold to control parallelism detection. Both hyperparameters are percentiles, i.e. the numerical range is between and . It has to be mentioned that this algorithm does not consider any attributes other than event activities during process discovery. The discovery algorithm is publicly available as a Java application [42].

Iii-F Neural Network

A neural network is a computing methodology motivated by biological nervous systems. Such networks consist of a set of artificial neurons which receive one or multiple inputs and produce one output. This set is divided into a predefined number of disjoint subsets

where . Each subset represents a layer in form of a matrix containing outputs of the corresponding neurons. We refer to layer as the input and as the output layer of the neural network. Multiple so-called hidden layers can exist in between. In a fully connected neural network, all neurons of a layer are connected to all neurons of its adjacent layer for . A very basic neural network can be defined in the following way [43, 44].

Fig. 1: This figure illustrates the flow diagram of the DREAM-NAP approach. It also visualizes the training and testing procedures. The elements of the approach are shown in green, train datasets in blue, test datasets in red, and evaluation datasets in yellow colors. The flows are color-coded correspondingly.

A neuron which belongs to layer calculates its output based on the weighted outputs of each predecessor neuron of layer . Each direct connection between two neurons and is associated with a weight . Each neuron

comprises a differentiable activation function

which is used to calculate the output of a neuron. Thus, the output of a neuron belonging to based on its predecessor layer can be calculated as


It follows that


where is a bias term. Such a neural network is commonly modeled as an optimization problem where a cost function is to be defined as a function of the difference between neural network outputs and true values and to be minimized by adapting the weights

of the neural network. This is called a supervised learning problem


Iv Approach

The DREAM-NAP approach consists of three steps. First, we discover PN model from an event log and associate each place of the PN with a decay function. Then, we replay the event log used for discovery and extract feature arrays incorporating decay function response values, token movement counts, and utilized resources. Finally, we train a neural network to predict the activity of a next event based on these feature arrays. A flow diagram of the training and testing procedure of our approach is visualized in Figure 1. In this section, we introduce each component in detail.

The source code of the proposed approach is available as a standalone Java application in our GitHub repository 111The repository URL will be made available upon acceptance..

Iv-a Decay Function Enhancement

To discover a PN, the corresponding event log has to consist of at least one non-empty trace and each event must be mappable to an observable activity. We draw on an existing PN discovery algorithm called split miner which has been introduced in Section III-E.

Decay functions are used to model data values that decrease over time. Such functions are commonly applied to population trend modeling, financial domains, and physical systems. The basic form of a decay function is


where is time, is the rate of decay, and is a constant corresponding to the initial value. The decay function can be easily modified to model more complex behavior such as exponential or squared declines. However, the linear decay function presented in Equation 7 is the simplest option.

We associate each place of the PN generated by split miner on an event log with a linear decay function . We denote the time difference between the current time, , and the most recent time a token has entered place , , by .


We initially set for all such that . In this way, we reset all decay functions of the PN. A decay function will activate as soon as a token enters a corresponding place . The value of this function declines over time and reactivates with a response value immediately when a token enters this place.

During replay, each event of an event log corresponds to a transition which fires immediately when a respective event is observed and token requirements are met. Instead of focusing on the fired transitions itself, we can also unambiguously identify the sequence of fired transitions by observing the movement of tokens between places. By enhancing each place with a decay function described in Equation 8, we assign a level of importance to recent token movements compared to past ones. This mechanism scales event time information into a range from to without discretization and loss of generality.

We control the level of importance using the two decay function parameters and . Ideally, should be set such that the slope of covers the whole range from to based on the reactivation durations of a place . In other words, the slope should not be too steep such that for a small , nor too flat such that for a large . This cannot be achieved using a single

value for all decay functions of the PN when applying this mechanism to real-world processes with varying durations of reactivation. For this reason, we estimate an individual decay rate

for each place . We define the set of all decay rates as where the cardinality of , , equals to .

We estimate by utilizing the event log consisting of traces, and the respective PN discovered by split miner on . Each trace consists of events. We refer to the th event of the th trace of an event log by , as mentioned in Section III-B. The maximum trace duration observed in is denoted by and satisfies the following condition.


For the estimation of , it is inevitable to know if a value for exists, i.e. if a place gets activated only once or if reactivations occur. Therefore, we define a function which returns the number of tokens that enter a place when replaying a trace . We estimate for two different cases based on the outcome of the following condition.


If Condition 10 holds, will be set to a value such that the response of will never equal to before the last event of a corresponding trace occurred. By doing so, we guarantee to carry information on the occurrence of a specific event in the response of the decay function until the end of a trace. Equation 11 defines mathematically for this case.


If Condition 10 does not hold, we consider the average reactivation duration of a place based on all traces of the respective event log. With this information, we set the decay rate to a value such that provides a level of recent token movement importance for the average duration between reactivations. Consequently, the slope will neither be to steep nor too flat. Mathematically, we can estimate by


where is the arithmetic mean function.

Iv-B Event Log Replay

After estimating all of for each place in , we can use the corresponding decay functions, , to obtain a decay function response for all at a specific time . We write as the vector of decay function response values. Each element of this vector corresponds to the response value of one specific place in the PN, i.e. the th element in corresponds to the response of the decay function at time associated to the th place in .

Since constitutes only the most recent activation of , we introduce a counting vector of size elements where the th element corresponds to the th place in . We initialize the counting vector at time , by setting each element to . When a token enters a specific place at time , the corresponding counter element will be incremented by such that reflects the number of tokens which have entered each place from time to .

Similarly, we introduce a counting vector which counts the occurrence of each unique event attribute value other than from time to when replaying an event log. Continuous attribute values require discretization in advance.

We replay the event log on the PN which has been enhanced using decay functions. , , , and the PN marking at time , , will be reset before a trace will be replayed. We then obtain vectors and PN states at each time corresponding to the timestamp values of the replayed events in . The concatenation of , , , and is called a timed PN state sample ,


where represents a vector concatenation. Thus, a timed PN state sample contains information about time-based token movements, i.e. when a token has entered a place the last time relative to the current time, token counts per place (loop information), and the current PN state using the marking. Optionally, if events of the event log encompassed further attributes, the timed PN state sample also contains information about them.

After replaying the event log , we obtain a set of timed PN state samples, , such that Condition 14 and 15 hold. represents the number of traces in event log and represents the number of events in a specific trace.


Iv-C Deep Learning

We use the set of timed PN state samples, , to predict the next event activity. For each where , we predict the corresponding next activity of the event given that the timed PN state sample does not contain the final marking .

This is a supervised classification problem as the event log and the set of activities are known. An event log usually consists of thousands of events across multiple traces. Hence, a deep neural network is a suitable method to conquer this problem due to the large amount of available data.

We propose two fully connected neural network architectures. One which ignores event attribute value count vectors in , and another one which considers each as is. With DREAM-NAP, we refer to the first neural network architecture, whereas DREAM-NAPr refers to the second one considering event attributes. The details of the architecture for DREAM-NAP are illustrated in Table I whereas the details of DREAM-NAPr are illustrated in Table II

. Both architectures have been developed in Python using Keras


with a Tensorflow backend


Parameter Value
# layers 5
# neurons per layer [input, input*1.2, input*0.6, input*0.3, output]
# dropout layer 4
dropout rate 0.2

# batch normalization layers

activation functions

[relu, relu, relu, relu, softmax]

loss categorical crossentropy
optimizer adam
TABLE I: Deep learning architecture for the DREAM-NAP model

The DREAM-NAP neural network consists of five layers. The first layer has the same size as the vector length of and correspondingly called input. The second layer has times, the third times, and the fourth times the size of the input

layer. Each of these layers use Rectified Linear Unit (ReLU) activation functions, which have proven major performance advantages over sigmoid and tanh

[47]. The final layer is the output layer with a size equal to .

The output layer utilizes a softmax activation function since we are interested in the probability of a specific . We use dropout [48] for regularization applied between each hidden layer as well as between the fourth and the output layer. We decide on the Adam optimizer [49] to train the neural network. Batch normalization [50] layers are not used in this architecture since no further regularization is required. Moreover, batch normalization did not improve the results of the DREAM-NAP architecture, as we will demonstrate in Section V.

Parameter Value
# layers 5
# neurons per layer               [250, 200, 150, 100, output]
# dropout layer 4
dropout rate 0.5
# batch normalization layers 4
activation functions [relu, relu, relu, relu, softmax]
loss categorical crossentropy
optimizer adam
TABLE II: The deep learning architecture of the DREAM-NAPr model.

The DREAM-NAPr architecture is similar to the DREAM-NAP one. However, in this architecture we use fixed layer sizes. This is due to the fact that the number of neurons per layer can easily be very large when considering the actual size of , i.e. when events in encompass many resources with several unique values. Since this architecture is most likely confronted with a higher probability of overfitting due to the number of event attribute values, we increase the dropout rate and consider batch normalization layers.

V Evaluation

We evaluate the proposed approach using the two neural network architectures introduced in Section IV-C against the most recent state-of-the-art methods. We train and test our models on the same benchmark datasets as the literature. In this section, we first provide an overview of the datasets, followed by the introduction of metrics we will use. We then report the fitness of the obtained PNs and evaluate the actual prediction performance of the neural network models on timed PN state samples.

We performed the discovery of PNs using split miner and the transformation of event logs to timed PN state samples on a computer running Windows 10 with an Intel i7-6700 CPU and 16GB RAM. This task took between 30 minutes and 4 hours depending on the size of the dataset. The training of the DREAM-NAP and DREAM-NAPr neural networks were performed on a Tesla K80 GPU and took between 15 minutes and 2 hours per dataset.

V-a Datasets

Our evaluation is based on three real-life benchmark datasets, specifically the Helpdesk [51], the Business Process Intelligence Challenge 2012 (BPIC12) [9], and the Business Process Intelligence Challenge 2013 (BPIC13) [52] dataset.

The Helpdesk dataset comprises events from a ticketing management process of an Italian software company. Each event consists of an activity and an associated timestamp. No further event attributes are used.

The BPIC12 dataset originates from a Dutch financial institute and represents the process of a loan application. It can be split into three subprocesses related to the work, the application itself, and the offer. All events encompass the required attribute as well as resource information. Moreover, each event describes a lifecycle status which is either complete, scheduled, or start. Finally, the traces of this event log carry information about the requested loan amount. We split the dataset into multiple subprocesses in order to be able to compare our results to the results of existing methods. We consider the complete event log without any filtering, denoted by BPIC12 - all. BPIC12 - all complete considers only events of lifecycle value complete. Similarly, we filter the original event log by work related activities only and consider all events, code-named as BPIC12 - work all, and events with lifecycle attribute value complete as BPIC12 - work complete. Additionally, we consider the subprocesses of offers and applications separately as BPIC12 - O and BPIC12 - A. These subprocess event logs consist of events with complete lifecycle values only.

The third log originates from Volvo IT and describe events from an incident and problem management system. Each event is associated with a timestamp, the actual activity, a lifecycle transition, and information about the group, responsible employee, resource country, organization country, involved organizations, impact, and the product. Events associated with a problem rather than an incident encompass a further attribute which describes the role of the affected organization. We split this dataset into two separate event logs handling incidents and problems independently. We call these two event logs BPIC13 - Incidents and BPIC13 - Problems.

An overview of all datasets, number of activities, and traces is given in Table III.

Dataset # events # activities # traces # resources
Helpdesk 13,710 9 3,804 0
BPIC12 - all 262,200 24 13,087 2
BPIC12 - all complete 164,506 23 13,0897 2
BPIC12 - work complete 72,413 6 9,658 2
BPIC12 - work all 170,107 7 9,658 2
BPIC12 - O 31,244 7 5,015 2
BPIC12 - A 60,849 10 13,087 2
BPIC13 - Incidents 65,533 4 7,554 7
BPIC13 - Problems 8,599 4 1,758 8
TABLE III: Number of events, activities, traces, and resources for each of the evaluated datasets.

V-B Metrics

We utilize 10-fold cross-validation to perform our evaluation. Therefore, we consider of the actual traces for training and for testing. The training set is used to discover a process model and to estimate corresponding decay function parameters. These parameters are used to replay the training as well as the test set to obtain timed PN state samples. Moreover, we split the training set after replaying into a training and holdout evaluation set. We finally obtain three disjoint datasets for training, validation, and testing a deep learning model. We train models on the training set only and select the best model based on the validation set. The best model is chosen at the lowest validation loss which is an effective and widely used approach to train neural networks called early stopping [53, 54, 55]. An overview of this procedure is visualized in Figure 1.

We evaluate the performance of our approach based on averaged accuracy, precision, and recall, and compare it against the earlier-introduced next event prediction techniques. The subsequent definitions of metrics are based on

[18, 56, 57].

Accuracy is defined as


where is the total number of timed PN state samples and the number of timed PN state samples with a next activity equal to the th activity in . Moreover, , , , and represent true positive, true negative, false positive, and false negative respectively.

Precision is defined as


Recall is defined as


In addition, we report the F-score for each dataset. This measure is the harmonic mean of precision and recall and provides information on how precise and robust an algorithm is.

F-Score is defined as

Fig. 2: This figure shows the interpretable PN obtained from the first training set of the Helpdesk dataset.

Finally, we report the area under the curve (AUC) of the receiver operating characteristic. It is a common classification analysis to determine which model predicts classes best. The closer an AUC value is to , the better the model is. Multiclass AUC is defined as


where and is the true positive and false positive rate for the th activity.

We measure the quality of the obtained PNs using a basic function called token-based replay fitness. This function calculates the fitness after replaying an event log based on the number of missing, consumed, remaining, and produced tokens [35].


We do not consider the complexity and precision of a PN in this work, since we are interested in the replayability of event logs only.

In a final step, we compare the performance of our approach with the ones of the state-of-the-art algorithms using a rank test. In addition, we perform a sign test to determine statistical significant improvements. This test method is a variation of a binomial test and considers the number of times an algorithm performed best [58].

V-C Petri Net Discovery

We utilize split miner and perform hyperparameter optimization to obtain the best set of and for each of the fold cross validation training sets of each dataset. and are initially set to and are increment in steps. A PN is discovered for each of the hyperparameter combinations.

We select the process model with the highest fitness score for further evaluation. Table IV illustrates the averaged fitness values of the best models for each of the training datasets. It can be seen that split miner is able to detect PNs with fitness values above . The models obtained on BPIC12 - work all, BPIC12 - A, and BPIC13 - Incidents even reach fitness scores above . This underscores that process mining techniques are able to unveil and accurately model basic behavior from event logs which we can leverage in our approach. However, none of the process model evaluations result in a perfect fitness score of . This is due to the fact that split miner filters infrequent behavior, i.e. discards information, which does not seem to correspond to the main process behavior.

We visualize the obtained PN from the first training set of the Helpdesk dataset in Figure 2. The white rectangles represent the unique activities observed in the event log, whereas black rectangles correspond to hidden transitions, i.e. transitions which are not mappable to activities, thus .

                 Dataset          Average Fitness
Helpdesk 0.928
BPIC12 - all 0.919
BPIC12 - all complete 0.88
BPIC12 - work complete 0.892
BPIC12 - work all 0.961
BPIC12 - O 0.856
BPIC12 - A 0.951
BPIC13 - Incidents 0.955
BPIC13 - Problems 0.925
TABLE IV: This table shows the averaged cross validated fitness scores of the PN models obtained from each dataset.

V-D Preprocessing

We evaluate both deep learning architectures, DREAM-NAP and DREAM-NAPr, against all datasets except Helpdesk. The former does not contain any event resource information and is therefore only performed on DREAM-NAP.

All datasets originating from BPIC12 contain one continuous and one categorical resource attribute. We discretize the continuous attribute by quantizing its values using disjoint intervals of size .

For the event logs originating from the BPIC13 dataset, we consider only the categorical attributes of resource country, organization country, involved organization, impact, and, if applicable, role of the affected organization. The number of unique values for the excluded resources are too large, hence these resources do not contribute beneficial and generalizable information.

We normalize each component the a timed PN state samples, specifically , , and

, of the training and validation set separately to zero mean and unit variance. The mean and standard deviation of each vector before preprocessing is used to normalize the test set.

V-E Results

Dataset Approach Model Accuracy Precision Recall F-Score AUC
DREAM DREAM-NAP 0.830 0.770 0.830 0.797 0.878
DREAM DREAM-NAPr - - - - -
Tax et al. 0.7123 - - - -
Khan et al. 0.714 - - - -
Helpdesk Mehdiyev et al. 0.782 0.632 0.781 0.711 0.762
DREAM DREAM-NAP 0.847 0.842 0.847 0.820 0.914
DREAM DREAM-NAPr 0.895 0.895 0.895 0.887 0.941
Evermann et al. 0.860 - - - -
Khan et al. 0.777 - - - -
BPIC12 - all Lee et al. 0.746 - - - -
DREAM DREAM-NAP 0.789 0.777 0.789 0.747 0.884
DREAM DREAM-NAPr 0.862 0.869 0.862 0.857 0.926
Evermann et al. 0.788 - - - -
BPIC12 - all complete Lee et al. 0.746 - - - -
DREAM DREAM-NAP 0.746 0.723 0.746 0.706 0.821
DREAM DREAM-NAPr 0.813 0.812 0.813 0.808 0.872
Tax et al. 0.76 - - - -
Evermann et al. 0.658 - - - -
Breuker et al. 0.719 - 0.578 - -
Lee et al. 0.780 - - - -
BPIC12 - work complete Mehdiyev et al. 0.831 0.811 0.832 - -
DREAM DREAM-NAP 0.879 0.882 0.879 0.878 0.920
DREAM DREAM-NAPr 0.896 0.897 0.896 0.895 0.931
Evermann et al. 0.832 - - - -
BPIC12 - work all Lee et al. 0.873 - - - -
DREAM DREAM-NAP 0.855 0.893 0.855 0.838 0.919
DREAM DREAM-NAPr 0.929 0.926 0.929 0.922 0.960
Evermann et al. 0.836 - - - -
Breuker et al. 0.811 - 0.647 - -
Lee et al. 0.804 - - - -
BPIC12 - O Mehdiyev et al. 0.821 0.847 0.822 - -
DREAM DREAM-NAP 0.805 0.748 0.805 0.761 0.893
DREAM DREAM-NAPr 0.950 0.958 0.950 0.949 0.974
Evermann et al. 0.834 - - - -
Breuker et al. 0.801 - 0.723 - -
Lee et al. 0.800 - - - -
BPIC12 - A Mehdiyev et al. 0.824 0.852 0.824 0.817 0.923
DREAM DREAM-NAP 0.699 0.667 0.699 0.653 0.683
DREAM DREAM-NAPr 0.883 0.879 0.883 0.879 0.878
Evermann et al. 0.735 - - - -
Breuker et al. 0.714 - 0.377 - -
Lee et al. 0.612 - - - -
BPIC13 - Incidents Mehdiyev et al. 0.663 0.648 0.664 0.647 0.862
DREAM DREAM-NAP 0.681 0.605 0.681 0.593 0.558
DREAM DREAM-NAPr 0.780 0.779 0.780 0.761 0.749
Evermann et al. 0.628 - - - -
Breuker et al. 0.690 - 0.521 - -
Lee et al. 0.500 - - - -
BPIC13 - Problems Mehdiyev et al. 0.662 0.641 0.662 - -
TABLE V: This table illustrates the results obtained by the proposed approach and contrasts it to existing state-of-the-art methods. Missing values were not reported by the corresponding authors.
Fig. 3:

This figure shows the training, test, and validation accuracy and loss (y-axis) over 100 training epochs (x-axis) for each dataset without considering resources. Each plot shows the first cross-validation run representative for all ten runs.

Fig. 4: This figure shows the training, test, and validation accuracy and loss (y-axis) over 100 training epochs (x-axis) for each dataset which resources got applied to. BPIC12 - work complete, BPIC12 - O, BPIC13 - Incidents and BPIC13 - Problems start overfitting comparatively early. Each plot shows the first cross-validation run representative for all ten runs.

We train the DREAM-NAP and DREAM-NAPr models using early stopping, but continue training for in total epochs for visualization purposes. The training batch size of DREAM-NAP is set to whereas the DREAM-NAPr batch size is set to to accelerate the training process. The detailed results are listed in Table V.

DREAM-NAP outperforms four out of nine benchmark datasets in terms of accuracy, precision, recall, F-score, and AUC. Especially on the Helpdesk dataset, we demonstrate that the decay mechanism in combination with token movement counts is extremely beneficial to predict the next activity. We outperform the current state-of-the-art by almost in accuracy. DREAM-NAP also surpasses the existing methods on BPIC12 - all complete, BPIC12 - work all, and BPIC12 - O in terms of accuracy. However, the improvement is marginal. Our proposed DREAM-NAP architecture does not outperform all existing methods on BPIC13 - Problems, BPIC13 - Incidents, BPIC12 - A, BPIC12 - work complete, and BPIC12 - all. Generally speaking, DREAM-NAP scores average to high ranks without considering resource information. This underscores that PNs extended with decay functions and token movement counters carry important information to predict the next activity in running process cases.

The DREAM-NAPr architecture outperforms seven out of eight datasets which encompass event resource information. Our approach outperforms Evermann et al. by and on BPIC12 - all and BPC12 - all complete in accuracy. Similarly, our accuracy on BPIC12 - work all is , on BPIC12 - O , on BPIC12 - A , on BPIC13 - Incidents , and on BPIC13 - Problems higher than the current best methods. Also, the obtained metric scores for precision, recall, F-score, and AUC are higher for all of the datasets. Solely on BPIC12 - work complete, we rank second behind Mehdiyev et al. who outperforms our proposed approach by in accuracy. However, our precision on this dataset is with slightly higher.

Overall, it can be seen that the accuracy, precision, and recall of our proposed architectures are well balanced across all benchmark datasets. Solely DREAM-NAP on Helpdesk and BPIC13 - Problems has a and lower precision compared to its accuracy and recall.

Figure 3 shows the training, evaluation, and validation accuracy and loss over epochs of the DREAM-NAP architecture for each dataset. It can be seen that none of the models tend to overfit. This confirms that batch normalization layers are not required for this neural network architecture. All models demonstrate a smooth learning curve and converge after a few training epochs.

Figure 4 visualizes the same metrics scores over training epochs for the DREAM-NAPr models. In comparison to the previous figure, all datasets tend to overfit early. Especially on BPIC12 - work complete, our architecture overfits and demonstrates the importance of early stopping. It can be noted that the models which overfit the earliest and strongest, are the models that perform the worst in terms of accuracy. Specifically, DREAM-NAPr on BPIC12 - work complete shows strong overfitting and scores second whereas BPIC13 - Incidents and BPIC13 - Problems demonstrate strong overfitting resulting in the lowest outperforming accuracies of all benchmark datasets.

The diagram shown in Figure 5 indicates the superiority of our proposed architectures. DREAM-NAP scores an average arithmetic rank of whereas DREAM-NAPr scores on average . In comparison, the method proposed by Mehdiyev et al., which beats our approach in accuracy in one of the datasets, scores an average rank of which is even lower than the one of DREAM-NAP. Therefore, our algorithm demonstrates superiority by performing well across a diverse set of event logs.

Fig. 5: Arithmetic means of ranks of the state-of-the-art and proposed approaches.

We can further statistically test whether the improvements in accuracy of our proposed approach are significant. However, due to the fact that none of the state-of-the-art methods tested their approach against all of the datasets, we are comparing our architectures against the best state-of-the-art algorithm on each dataset. We are using a sign test due to the small number of available samples, i.e. for DREAM-NAP and for DREAM-NAPr. We set the level of significance to and adjust it using the Dunn-Sidak correction [59]

to control the type I error. Therefore, the level of significance for



and for DREAM-NAPr is


The sign test for DREAM-NAP results in a -value of whereas it results in for DREAM-NAPr. We can see that , thus DREAM-NAPr shows significant improvements in accuracy over the state-of-the-art. This underscores the superiority of our proposed method.

The results show that our approach performs with consistently high metric scores across a diverse set of event logs. It demonstrates statistical superiority over the state-of-the-art methods, therefore presenting major improvements.

Vi Conclusion

In this paper, we introduced a novel approach on predicting next event activities in running process cases called DREAM-NAP. Specifically, we extended the places of PN process models with decay functions to obtain timed PN state samples when replaying an event log. These timed samples are used to train a deep neural network which accurately predicts the next event activity in a running process case. Our results surpass many state-of-the-art techniques. We obtain cross-validated accuracies above and show robust, precise performances across a diverse set of real-world event logs. This underscores the feasibility and usefulness of our proposed approach.

We have shown that decay functions are useful to express a PN state during process runtime. In this way, event timestamps are not discretized, but reflected as a continuous value without loss of generality. This is important since the duration between two events might be correlated with the subsequent occurring activity.

Many recent techniques applied LSTMs to predict next event activities. However, these techniques have an underestimated disadvantage in long-running and complex processes cases. LSTMs are good in memorizing recent events, however distant events vanish from the memory state vector. With our approach, we present a robust solution to overcome this memory limitation.

While most recent techniques model processes implicitly, our approach is based on explicit process models. Therefore, our method is easier to interpret than algorithms which are based only on deep learning. While decision making of neural networks are naturally hard to understand and explain [60, 61, 62], we are retaining an understandable process model in combination with a simple deep learning architecture. Therefore, organizations will still be able to debug their processes using graphical representations of PNs while taking advantage of the predictive capabilities. A sensitivity analysis can be performed to interpret the decision making of the neural network which performs on top of the decay function extended PN process model.

This paper introduced a promising novel approach, thus enabling future work. First, the predictive quality might be able to be improved by incorporating quality performance measures of process discovery algorithms apart from fitness scores. Second, we have applied the simplest kind of decay function. A comprehensive study on different decay function types might improve the predictive performance of our approach. Finally, the presented approach has been applied to next event activity prediction only, but might be applicable to different predictive process management tasks such as remaining case time prediction, next event timestamp prediction, or anomalous process state predictions, too.

Appendix: Notations

[style =standard, labelindent=0em , labelwidth=2.5cm, labelsep*=1em, leftmargin =!]

null (in context of return value of a function) or non-observable activity

function returning value of attribute name of event


set of all activities

decay rate

decay rate for a specific place

constant parameter of a decay function

token counting vector from time to , each element represents the number of tokens which entered a specific place

attribute name

timestamp attribute name

set of all attribute names

function of average time between a token is consumed in place until a new token is produced in based on an input trace

time difference between current time and most recent time a token has entered place

Maximum observed trace duration in an event log

event identifier

set of all event identifier

split miner filtering threshhold hyperparameter

split miner parallelism threshhold hyperparameter

set of all arcs of a PN

decay function of place

decay function response vector

false negative

false positive

false positive rate

case identifier

length of a trace

set of all traces

function which maps each event to an activity

denotes a neural network layer in form of a matrix

event log which is a set of traces

th event in the th trace of an event log

vector representing the marking of a PN

final marking

initial marking

th element of marking

vector representing the marking of a PN at time

set of all markings

number of tokens a place produces when replaying a trace


set of all places

cardinality of set of places


weighted input of neuron from previous layer

function which maps transitions to either event activities or non-observable activities

attribute value counting vector from time to

set of all of a PN

cardinality of the set

activation function based on input layer

timed PN state sample at time

set of timed PN state samples

function returning number of tokens of a place


true negative

true positive

true positive rate

set of all transitions


most recent time that a token entered place

output function of a neuron based on input layer

weight of direct connection between two neurons and

set of input nodes of a node

set of output nodes of a node

cost function

set of all non-negative integers

indices, integers, and variables used in different contexts


  • [1] C. Janiesch, A. Koschmider, M. Mecella, B. Weber, A. Burattin, C. D. Ciccio, A. Gal, U. Kannengiesser, F. Mannhardt, J. Mendling, A. Oberweis, M. Reichert, S. Rinderle-Ma, W. Song, J. Su, V. Torres, M. Weidlich, M. Weske, and L. Zhang, “The internet-of-things meets business process management: mutual benefits and challenges,” arXiv preprint arXiv:1709.03628, 2017.
  • [2] R. Seiger, U. Assmann, and S. Huber, “A case study for workflow-based automation in the internet of things,” 2018 IEEE International Conference on Software Architecture Companion (ICSA-C), pp. 11–18, 2018.
  • [3] I. Hwang and Y. J. Jang, “Process mining to discover shoppers'  pathways at a fashion retail store using a wifi-base indoor positioning system,” IEEE Transactions on Automation Science and Engineering, vol. 14, no. 4, pp. 1786–1792, Oct 2017.
  • [4] V. A. Rubin, A. A. Mitsyuk, I. A. Lomazova, and W. M. van der Aalst, “Process mining can be applied to software too!” in Proceedings of the 8th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement.   ACM, 2014, p. 57.
  • [5] H. Darabi, W. L. Galanter, J. Y. Lin, U. Buy, and R. Sampath, “Modeling and integration of hospital information systems with petri nets,” in 2009 IEEE/INFORMS International Conference on Service Operations, Logistics and Informatics, July 2009, pp. 190–195.
  • [6] S. Goedertier, J. De Weerdt, D. Martens, J. Vanthienen, and B. Baesens, “Process discovery in event logs: An application in the telecom industry,” Applied Soft Computing, vol. 11, no. 2, pp. 1697–1710, 2011.
  • [7] M. Haji and H. Darabi, “Petri net based supervisory control reconfiguration of project management systems,” in 2007 IEEE International Conference on Automation Science and Engineering.   IEEE, 2007, pp. 460–465.
  • [8] N. Wightkin, U. Buy, and H. Darabi, “Formal modeling of sequential function charts with time petri nets,” IEEE Transactions on Control Systems Technology, vol. 19, no. 2, pp. 455–464, March 2011.
  • [9] B. B. Van Dongen, “Bpi challenge 2012,” https://doi.org/10.4121/uuid:3926db30-f712-4394-aebc-75976070e91f, 2012.
  • [10] W. M. Aalst, “Transactions on petri nets and other models of concurrency ii,” K. Jensen and W. M. Aalst, Eds.   Berlin, Heidelberg: Springer-Verlag, 2009, ch. Process-Aware Information Systems: Lessons to Be Learned from Process Mining, pp. 1–26. [Online]. Available: http://dx.doi.org/10.1007/978-3-642-00899-3_1
  • [11] I. Verenich, “A general framework for predictive business process monitoring,” in CAiSE 2016 Doctoral Consortium (CAiSE 2016), Ljubljana, Slovenia, June 2016. [Online]. Available: https://eprints.qut.edu.au/97539/
  • [12] D. Breuker, M. Matzner, P. Delfmann, and J. Becker, “Comprehensible predictive models for business processes,” MIS Q., vol. 40, no. 4, pp. 1009–1034, Dec. 2016. [Online]. Available: https://doi.org/10.25300/MISQ/2016/40.4.10
  • [13] B. Kang, D. Kim, and S.-H. Kang, “Real-time business process monitoring method for prediction of abnormal termination using knni-based lof prediction,” Expert Systems with Applications, vol. 39, no. 5, pp. 6061 – 6068, 2012.
  • [14] P. Leitner, B. Wetzstein, F. Rosenberg, A. Michlmayr, S. Dustdar, and F. Leymann, “Runtime prediction of service level agreement violations for composite services,” in Service-oriented computing. ICSOC/ServiceWave 2009 workshops.   Springer, 2009, pp. 176–186.
  • [15] N. Tax, I. Verenich, M. La Rosa, and M. Dumas, “Predictive business process monitoring with lstm neural networks,” in Advanced Information Systems Engineering, E. Dubois and K. Pohl, Eds.   Cham: Springer International Publishing, 2017, pp. 477–492.
  • [16] J. Evermann, J.-R. Rehse, and P. Fettke, “Predicting process behaviour using deep learning,” Decision Support Systems, vol. 100, pp. 129–140, 2017.
  • [17] A. Khan, H. Le, K. Do, T. Tran, A. Ghose, H. Dam, and R. Sindhgatta, “Memory-augmented neural networks for predictive process analytics,” arXiv preprint arXiv:1802.00938, 2018.
  • [18] N. Mehdiyev, J. Evermann, and P. Fettke, “A novel business process prediction model using a deep learning method,” Business & Information Systems Engineering, pp. 1–15, 2018.
  • [19] I. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” in International Conference on Learning Representations, 2015. [Online]. Available: http://arxiv.org/abs/1412.6572
  • [20] W. Reisig and G. Rozenberg, Eds., Lectures on Petri Nets I: Basic Models, Advances in Petri Nets, the Volumes Are Based on the Advanced Course on Petri Nets.   London, UK, UK: Springer-Verlag, 1998.
  • [21] S. Ganguli, D. Huh, and H. Sompolinsky, “Memory traces in dynamical systems,” Proceedings of the National Academy of Sciences, vol. 105, no. 48, pp. 18 970–18 975, 2008.
  • [22] S. Fusi, P. J. Drew, and L. F. Abbott, “Cascade models of synaptically stored memories,” Neuron, vol. 45, no. 4, pp. 599–611, 2005.
  • [23] J. Weston, S. Chopra, and A. Bordes, “Memory networks,” arXiv preprint arXiv:1410.3916, 2014.
  • [24] H. Wang, L. Grigore, U. Buy, and H. Darabi, “Enforcing transition deadlines in time petri nets,” in 2007 IEEE Conference on Emerging Technologies and Factory Automation (EFTA 2007).   IEEE, 2007, pp. 604–611.
  • [25] U. Buy, H. Darabi, M. Lehene, and V. Venepally, “Supervisory control of time petri nets using net unfolding,” in 29th Annual International Computer Software and Applications Conference (COMPSAC’05), vol. 2.   IEEE, 2005, pp. 97–100.
  • [26] M. Polato, A. Sperduti, A. Burattin, and M. de Leoni, “Time and activity sequence prediction of business process instances,” Computing, vol. 100, no. 9, pp. 1005–1031, 2018.
  • [27] N. Navarin, B. Vincenzi, M. Polato, and A. Sperduti, “Lstm networks for data-aware remaining time prediction of business process instances,” in 2017 IEEE Symposium Series on Computational Intelligence (SSCI).   IEEE, 2017, pp. 1–7.
  • [28] C. Di Francescomarino, C. Ghidini, F. M. Maggi, G. Petrucci, and A. Yeshchenko, “An eye into the future: Leveraging a-priori knowledge in predictive business process monitoring,” in Business Process Management, J. Carmona, G. Engels, and A. Kumar, Eds.   Cham: Springer International Publishing, 2017, pp. 252–268.
  • [29] M. Le, B. Gabrys, and D. Nauck, “A hybrid model for business process event prediction,” in Research and Development in Intelligent Systems XXIX, M. Bramer and M. Petridis, Eds.   London: Springer London, 2012, pp. 179–192.
  • [30] J. Becker, D. Breuker, P. Delfmann, and M. Matzner, “Designing and implementing a framework for event-based predictive modelling of business processes,” in Enterprise modelling and information systems architectures - EMISA 2014, F. Feltz, B. Mutschler, and B. Otjacques, Eds.   Bonn: Gesellschaft für Informatik e.V., 2014, pp. 71–84.
  • [31] M. Ceci, P. F. Lanotte, F. Fumarola, D. P. Cavallo, and D. Malerba, “Completion time and next activity prediction of processes using sequential pattern mining,” in Discovery Science, S. Džeroski, P. Panov, D. Kocev, and L. Todorovski, Eds.   Cham: Springer International Publishing, 2014, pp. 49–61.
  • [32] G. T. Lakshmanan, D. Shamsi, Y. N. Doganata, M. Unuvar, and R. Khalaf, “A markov prediction model for data-driven semi-structured business processes,” Knowledge and Information Systems, vol. 42, no. 1, pp. 97–126, Jan 2015. [Online]. Available: https://doi.org/10.1007/s10115-013-0697-8
  • [33] M. Unuvar, G. T. Lakshmanan, and Y. N. Doganata, “Leveraging path information to generate predictions for parallel business processes,” Knowledge and Information Systems, vol. 47, no. 2, pp. 433–461, May 2016. [Online]. Available: https://doi.org/10.1007/s10115-015-0842-7
  • [34] W. L. J. Lee, D. Parra, J. Munoz-Gama, and M. Sepulveda, “Predicting process behavior meets factorization machines,” Expert Systems with Applications, vol. 112, pp. 87–98, 2018.
  • [35] W. M. P. van der Aalst, Process Mining: Discovery, Conformance and Enhancement of Business Processes, 1st ed.   Springer Publishing Company, Incorporated, 2011.
  • [36] Q. Guo, L. Wen, J. Wang, Z. Yan, and P. S. Yu, “Mining invisible tasks in non-free-choice constructs,” in Business Process Management, H. R. Motahari-Nezhad, J. Recker, and M. Weidlich, Eds.   Cham: Springer International Publishing, 2015, pp. 109–125.
  • [37] A. Adriansyah, B. F. van Dongen, and W. M. van der Aalst, “Conformance checking using cost-based fitness analysis,” in 2011 IEEE 15th International Enterprise Distributed Object Computing Conference.   IEEE, 2011, pp. 55–64.
  • [38] W. M. P. van der Aalst, K. M. van Hee, A. H. M. ter Hofstede, N. Sidorova, H. M. W. Verbeek, M. Voorhoeve, and M. T. Wynn, “Soundness of workflow nets: classification, decidability, and analysis,” Formal Aspects of Computing, vol. 23, no. 3, pp. 333–363, May 2011. [Online]. Available: https://doi.org/10.1007/s00165-010-0161-4
  • [39] V. A. Rubin, A. A. Mitsyuk, I. A. Lomazova, and W. M. P. van der Aalst, “Process mining can be applied to software too!” in Proceedings of the 8th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, ser. ESEM ’14.   New York, NY, USA: ACM, 2014, pp. 57:1–57:8. [Online]. Available: http://doi.acm.org/10.1145/2652524.2652583
  • [40] A. Augusto, R. Conforti, M. Dumas, M. La Rosa, and A. Polyvyanyy, “Split miner: automated discovery of accurate and simple business process models from event logs,” Knowledge and Information Systems, May 2018. [Online]. Available: https://doi.org/10.1007/s10115-018-1214-x
  • [41] A. Augusto, R. Conforti, M. Dumas, M. La Rosa, F. M. Maggi, A. Marrella, M. Mecella, and A. Soo, “Automated discovery of process models from event logs: Review and benchmark,” IEEE Transactions on Knowledge and Data Engineering, 2018.
  • [42] R. Conforti, “Research code,” https://github.com/raffaeleconforti/ResearchCode, 2016.
  • [43] A. Zell, “Komponenten neuronaler modelle,” in Simulation neuronaler netze.   Addison-Wesley Bonn, 1994, ch. 6, pp. 87–95.
  • [44] I. Goodfellow, Y. Bengio, and A. Courville, “Deep feedforward networks,” in Deep Learning.   MIT Press, 2016, ch. 6, pp. 164–223.
  • [45] F. Chollet et al., “Keras,” https://github.com/fchollet/keras, 2015.
  • [46] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015, software available from tensorflow.org. [Online]. Available: http://tensorflow.org/
  • [47]

    V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in

    Proceedings of the 27th International Conference on International Conference on Machine Learning, ser. ICML’10.   USA: Omnipress, 2010, pp. 807–814. [Online]. Available: http://dl.acm.org/citation.cfm?id=3104322.3104425
  • [48] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” J. Mach. Learn. Res., vol. 15, no. 1, pp. 1929–1958, Jan. 2014. [Online]. Available: http://dl.acm.org/citation.cfm?id=2627435.2670313
  • [49] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [50] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” 2015, pp. 448–456. [Online]. Available: http://jmlr.org/proceedings/papers/v37/ioffe15.pdf
  • [51] M. M. Polato, “Dataset belonging to the help desk log of an italian company,” https://doi.org/10.4121/uuid:0c60edf1-6f83-4e75-9367-4c63b3e9d5bb, 2017.
  • [52] W. Steeman, “Bpi challenge 2013,” https://doi.org/10.4121/uuid:a7ce5c55-03a7-4583-b855-98b86e1a2b07, 2013.
  • [53] C. M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics).   Berlin, Heidelberg: Springer-Verlag, 2006.
  • [54] I. Goodfellow, Y. Bengio, and A. Courville, “Regularization for deep learning,” in Deep Learning.   MIT Press, 2016, ch. 7, p. 247.
  • [55] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understanding deep learning requires rethinking generalization,” arXiv preprint arXiv:1611.03530, 2016.
  • [56] M. Sokolova and G. Lapalme, “A systematic analysis of performance measures for classification tasks,” Information Processing & Management, vol. 45, no. 4, pp. 427–437, 2009.
  • [57] A. P. Bradley, “The use of the area under the roc curve in the evaluation of machine learning algorithms,” Pattern recognition, vol. 30, no. 7, pp. 1145–1159, 1997.
  • [58]

    J. Demšar, “Statistical comparisons of classifiers over multiple data sets,”

    Journal of Machine learning research, vol. 7, no. Jan, pp. 1–30, 2006.
  • [59]

    Z. Šidák, “Rectangular confidence regions for the means of multivariate normal distributions,”

    Journal of the American Statistical Association, vol. 62, no. 318, pp. 626–633, 1967.
  • [60] G. Marcus, “Deep learning: A critical appraisal,” arXiv preprint arXiv:1801.00631, 2018.
  • [61]

    F. K. Dos̆ilović, M. Brc̆ić, and N. Hlupić, “Explainable artificial intelligence: A survey,” in

    2018 41st International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), May 2018, pp. 0210–0215.
  • [62] W. Samek, T. Wiegand, and K.-R. Müller, “Explainable artificial intelligence: Understanding, visualizing and interpreting deep learning models,” arXiv preprint arXiv:1708.08296, 2017.