Quartile-based Prediction of Event Types and Event Time in Business Processes using Deep Learning

02/11/2021 ∙ by Ishwar Venugopal, et al. ∙ University of Essex 0

Deep learning models are now being increasingly used for predictive process mining tasks in business processes. Modern approaches have been successful in achieving better performance for different predictive tasks, as compared to traditional approaches. In this work, five different variants of a model involving a Graph Convolutional Layer and linear layers have been tested for the task of predicting the nature and timestamp of the next activity in a given process instance. We have introduced a new method for representing feature vectors for any individual event in a given process instance, taking into consideration the structure of Directly-follows process graphs generated from the corresponding datasets. The adjacency matrix of the process graphs generated has been used as input to a Graph Convolutional Network (GCN). Different model variants make use of variations in the representation of the adjacency matrix. The performance of all the model variants have been tested at different stages of a process, determined by quartiles estimated based on the number of events and the case duration. The results obtained from the experiments, significantly improves over the previously reported results for most of the individual tasks. Interestingly, it was observed that a linear Multi-Layer Perceptron (MLP) with dropout was able to outperform the GCN variants in both the prediction tasks. Using a quartile-based analysis, it was further observed that the other variants were able to perform better than MLP at individual quartiles in some of the tasks where the MLP had the best overall performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

gcn-process-mining

Source code for arXiv paper (https://arxiv.org/abs/2102.07838)


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Most of the businesses thrive on the effective use of event logs and process records. The ability to predict the nature of an unseen event in a business process can have very useful applications (Breuker et al., 2016). This can help in more efficient customer service and facilitate in developing an improved work-plan for companies.

The domain of process mining deals with combining a wide range of classical model-based predictive techniques along with traditional data analysis techniques (Van Der Aalst, 2016). A process can be a representation of any set of activities that take place in a business enterprise; namely, the procedure for obtaining a financial document, the steps involved in a complaint-registering system etc. Business process mining, in general, deals with analysis of the sequence of events produced during the execution of such processes (Castellanos et al., 2004; Maggi et al., 2014; Márquez-Chamorro et al., 2017). A detailed explanation on such event logs is available in the ’Supplementary materials’. Even though the classical approach of depicting event logs is with the help of process graphs (Agrawal et al., 1998; Van der Aalst et al., 2003), Tax et al. (Tax et al., 2017) and Pasquadibisceglie et al. (Pasquadibisceglie et al., 2019)

have recently applied Deep Learning techniques like Long Short-Term Memory (LSTM) networks and Convolutional Neural Networks (CNN) for the task of predictive process mining and obtained results that outperformed traditional models. Inspired from these works and taking into consideration the graph nature of processes, we aim to model event logs as graph structures and apply various deep learning models on such data structures. In this work, we have used a new representation for the event log data and investigated the performance of different variants of a Graph Neural Network (GNN) along with a linear Multi-Layer Perceptron (MLP).

The rest of the paper is structured as follows: Section 2 discusses on related works that have been published in this direction. The datasets and the pre-processing techniques are introduced in Section 3. This section also deals with explanations regarding the experimental procedure that was followed and the different models used for the experiments. Section 4 highlights the major results obtained from the experiments that were carried out, followed by a discussion in Section 5. Finally, the major conclusions drawn from this work are compiled in Section 6.

2. Related Works

Attribute Dataset
Helpdesk BPI’12 (W)
BPI’12 (W)
(no repeats)
Total number of events 13710 72413 29410
Total process instances/cases 3804 9658 9658
No. of unique activities/events 9 6 6
Average case duration (in seconds) 22474.71 1363.74 1363.74
Average no. of events per case 3.604 7.498 3.045
Table 1. Overview of the datasets used

Business process mining generally deals with several prediction tasks; like predicting the next activity/event (Becker et al., 2014; Tax et al., 2017; Pasquadibisceglie et al., 2019; Evermann et al., 2016; Breuker et al., 2016), the timestamp of the next event in the process (Tax et al., 2017; Van der Aalst et al., 2011), the overall outcome of a given process (Taylor, 2017) or the time remaining until the completion of a given process instance (Rogge-Solti and Weske, 2013). This work focuses on the first two aspects of the aforementioned list of predictive tasks, namely, the task of predicting the nature and timestamp of the next event in a given process.

There is a recent shift towards Deep Learning Models for the task of predictive business process monitoring. Tax et al. (Tax et al., 2017)

proposed to use a Recurrent Neural Network architecture using Long Short-Term Memory (LSTM) networks for the task of predicting the next activity and timestamp, suffix length and the remaining cycle time. It was able to model the temporal properties of the data and improve on the results obtained from traditional process mining techniques. The main motivation for using an LSTM model was to obtain results that were consistent for a range of tasks and datasets. The LSTM architecture that was introduced could also be extended to the task of predicting the case outcome. Evermann et al.

(Evermann et al., 2016) had also attempted to use a Recurrent Neural Network for the task of predicting the next event. Pasquadibisceglie et al. (Pasquadibisceglie et al., 2019) used Convolutional Neural Networks (CNN) for the task of predictive process analytics. An image-like data engineering approach was used to model the event logs and obtain results from benchmark datasets. In order to adapt a CNN for process mining tasks, a novel technique of transforming temporal data into a spatial structure similar to images was introduced. The results obtained from the CNN model improves over the accuracy scores obtained by the LSTM architecture(Tax et al., 2017) for the task of predicting the next event. In other works, there have been attempts to include features from unstructured data like texts into different Deep Learning architectures. Ding et al. (Ding et al., 2015) demonstrates how a Deep Learning model using an event-driven approach was able to provide better stock predictions, by using event-detection from texts. Teinemaa et al. (Teinemaa et al., 2016) also aimed to improve the performance of predictive business models by using text-mining techniques on the unstructured data present in event logs. Some of the other approaches to solving this problem of event-prediction and time-prediction were based on traditional process mining tools (Breuker et al., 2016; Van der Aalst et al., 2011).

Scarselli et al. (Scarselli et al., 2008)

introduced the Graph Neural Network (GNN) as a new competitor to Deep Learning techniques that could efficiently perform feature extraction. Graph data-structures provided a systematic method to represent complex relationships within a given data. Wu et al.

(Wu et al., 2020)

provides a comprehensive survey of GNN techniques that have been implemented in different domains, by categorizing different architectures into Convolutional Graph Neural Networks (ConvGNN), Spatio-temporal Graph Neural Networks (STGNNs), Recurrent Graph Neural Networks (RecGNN) and Graph Autoencoders (GAEs).

Esser et al. (Esser and Fahland, 2019) discusses on the advantages of using graph structures to model event logs. Performing process mining tasks by modelling the relationships between events and case instances as process graphs, has been a widely accepted approach (Maruster et al., 2002; Van Der Aalst et al., 2007).

In this work, we aim to combine traditional process mining from event graphs along with Deep Learning techniques like Graph Convolutional Networks to achieve a better performance in predictive business process monitoring. We specifically focus on the task of predicting the next activity and its timestamp in a given instance of an incomplete process.

3. Experimental Apparatus

Figure 1. Directly-Follows Graphs generated for the helpdesk dataset (left) and the BPI’12 dataset (right) using PM4Py. The nodes represent the unique Activity IDs present in the respective datasets along with their frequencies denoted in brackets. The numbers on the directed edges denote the frequency of directly-follows relations between the two nodes

In this section, we start by briefly explaining the datasets used in this work and the methodology adopted for representing the feature vectors corresponding to each row in the dataset. Following this, a mathematical formulation of graphs and the specific case of process graphs is provided; which lays the foundation to understand a Graph Convolutional layer. We conclude this section with a description of the procedure and metrics adopted for this work.

3.1. Datasets

The following two benchmark event log datasets have been used for this work:

3.1.1. Helpdesk dataset

111https://data.mendeley.com/datasets/39bp3vv62t/1

This dataset presents event logs obtained at the helpdesk of an Italian software company. The events in the log corresponds to the activities associated with different process instances of a ticket management scenario. It is a database of 13710 events related to 3804 different process instances. Each process contains events from a list of nine unique activities involved in the process. A typical process instance spans events from inserting a new ticket, until its closed or resolved. The average case duration in terms of the time taken and the number of activities per case, has been depicted in Table 1.

3.1.2. Business Process Intelligence 2012 (Sub-process W) dataset

222https://github.com/verenich/ProcessSequencePrediction/blob/master/data/bpi_12_w.csv

The original Business Process Intelligence Challenge (BPIC’12) dataset333https://www.win.tue.nl/bpi/doku.php?id=2012:challenge&redirect=1id=2012/challenge contains event logs of a process consisting of three sub-processes, which in itself is a relatively large dataset. As described in (Tax et al., 2017) and (Pasquadibisceglie et al., 2019), only completed events that are executed manually are taken into consideration for predictive analysis. This includes 72413 events from 9658 process instances. Each event in a process is one among six different unique activities involved in a single process instance. The activities denote the steps involved in the application procedure for financial needs, like personal loans and overdrafts. We also took into consideration a different version of this dataset444https://github.com/verenich/ProcessSequencePrediction/blob/master/data/bpi_12_w_no_repeat.csv which was used in the suffix prediction task of (Tax et al., 2017). It has reduced instances of the same event following itself in any particular case instance, as shown in Table 1. This variant will be referred to as the BPI’12 dataset with ”no repeats” throughout this paper.

All the datasets are characterised by three columns; ’Case ID’, ’Activity ID’ and the ’Complete Timestamp’ denoting the time at which a particular event took place. A complete preliminary overview of the datasets have been presented in Table 1.

3.2. Graphs

In mathematical terms, a graph can be represented as follows (Wu et al., 2020):

(1)

where V is the set of nodes and E denotes the edges present between the nodes. e E denotes an edge between node i and node j. A graph can be either directed or undirected depending on the nature of interaction between the nodes. In addition, a graph may be characterised by node attributes or edge attributes; which in simple terms, are feature vectors associated with that particular node or edge.

Generally, the adjacency matrix of a graph is an n x n matrix with A = 1 if e E and A = 0 if e E, where n is the number of nodes in the graph. A degree matrix is a diagonal matrix which stores the degree of each node, which numerically corresponds to the number of edges that the node is attached to.

3.2.1. Process Graphs

Process discovery from event logs can be achieved using different traditional process mining techniques. In this work, we have used an inductive mining approach with Directly-Follows Graphs (DFG) to represent the processes extracted from each of the datasets. The choice is motivated by the simplicity and efficiency with which the entire data can be represented in the form of a graph.

A Directly-Follows Graph for an Event Log L is denoted as (Van Der Aalst, 2016):

(2)

where A is the set of activities in L with A and A denoting the set of start and end activities, respectively. denotes the directly-follows operation, which exists between two events if and only if there is a case instance in which the source event is followed by the other target event. The nodes in the graph represent the unique activities present in the event log, and the directed edges of the graph exist if there is a directly-follows relation between the nodes. The number of directly-follows relations that exist between two nodes is denoted by a weight for the corresponding edge.

A Berti et al.(Berti et al., 2019) presented a process-mining tool for the Python environment called PM4Py. The Directly-Follows Graphs for both the datasets were visualised using the PM4Py package as shown in Figure 1.

Figure 2. The system architecture for the Event Predictor (above) and the Timestamp Predictor (below)

3.2.2. Graph Convolutional Layer

A graph can be described and represented with its adjacency matrix. Combining the adjacency matrix with the feature vector representations of each node, can give rise to an efficient method to compute node embeddings. This operation is widely used in Graph Convolutional Networks (GCN). The convolution operation in a GCN layer can be mathematically denoted as follows (Kipf and Welling, 2016):

(3)

where X is the input feature matrix containing the feature vector for each of the nodes, A is the adjacency matrix of the graph, D is the degree matrix, W is the learn-able weight vector and

is the activation function of that layer.

Multiplying the inverse of the degree matrix with the adjacency matrix corresponds to the normalisation of the adjacency matrix. As matrix multiplication is influenced by the side from the which the component is multiplied from, some researchers prefer to use an alternate method to account for a symmetric normalisation as follows (Kipf and Welling, 2016):

(4)

A detailed explanation regarding Graph Convolutional Networks is provided in the ’Supplementary materials’.

3.3. Data pre-processing

The timestamp corresponding to each event in the dataset can be used to derive a feature vector representation for each row in the data. The approach introduced in (Tax et al., 2017) has been used to initially get a feature vector having 4 elements. The various elements in this representation correspond to the following: time since last event in the case, time since case start, time since midnight and day of the week when that particular event occurred. This would then result in a 4-element feature vector for every row in the dataset. The drawback in this kind of a representation is that it treats each event in a case independently, without considering the features for other events that have taken place in that particular case instance. In order to overcome this drawback, it was necessary for the feature vector of every event to have a history of other events that had already occurred for that particular Case ID. Hence, a new comprehensive feature vector representation was introduced.

In this work, we have used a matrix representation for each entry in the dataset. The dimensions of this matrix depends on the dataset which is considered. The number of rows in this matrix corresponds to the number of unique activities/events present in that particular dataset. This number can be obtained by identifying the unique entries in the ’Activity ID’ column or can be visually identified as the number of nodes in the process graphs for each of the datasets (Figure 1). Let us denote this value by ’numofnodes’ for ease of representation. As it can be observed from Table 1, numofnodes is 9 for the Helpdesk dataset and 6 for both the versions of BPI’12 (W) dataset. The number of columns in this matrix corresponds to the length of the initial feature vector i.e. 4. This would result in a matrix of size ’ x 4’ for each data entry.

The advantage of having such a representation is that the feature matrix of each data entry can now store the features corresponding to other events that have occurred in the same case instance. The first row of the feature matrix stores the 4-element long feature vector for the event with Activity ID equal to 1 in that particular Case ID. Similarly the second row stores the features of the event with Activity ID equal to to 2 and so on. Hence, each row index stores the 4-element long feature vector corresponding to the Activity ID denoted by that particular index. One approximation that we have used in this step is that; if an event corresponding to a particular Activity ID has occurred more than once in a case instance, we use only the last occurrence of that event. And in scenarios where events with a particular Activity ID has not occurred in a given case instance, the feature matrix will store a vector with zeroes corresponding to that Activity ID.

This method of representation gives each row a 9 x 4 matrix for the Helpdesk dataset and a 6 x 4 matrix for the BPI’12 (W) dataset. The motivation behind choosing such a representation is to facilitate the computation involved in a Graph Convolutional Layer, as explained in Section 3.2.2. To understand this statement further, consider the following binary adjacency matrix for the process graph generated from the BPI’12 (W) dataset:

It is a 6 x 6 matrix which needs to be normalised as per Equation 4, to be used in a Graph Convolutional layer. The degree matrix for this binary adjacency matrix, would be a diagonal matrix with entries corresponding to the number of edges connected to each node represented by that position. Numerically, this can be computed as a row-wise sum for the matrix. The degree matrix for the binary adjacency matrix of BPI’12 dataset is shown below:

The normalised version of , after multiplying with the inverse of as per Equation 4 is given below:

As this normalised matrix is now a 6 x 6 matrix, the dimension of the feature vector representation (6 x 4 for BPI’12 (W) dataset) makes it compatible for matrix multiplication in the GCN layer. In general, the normalised adjacency matrix will have dimensions x and an input feature matrix will have dimensions x 4.

3.4. Procedure

The network depicted in Figure 2 shows the architecture for the model that learns the next Activity ID and the timestamp of the next activity. The overall structure that was constructed for this work, mainly focuses on a Graph Convolutional Layer followed by a Sequential layer consisting of three linear layers with Dropout. We have introduced five different variants of this general architecture, for the experiments carried out in this work. Each of these variants have been explained in the subsequent sub-sections.

3.4.1. GCN with Weighted Adjacency Matrix

The adjacency matrix of the process graph depicted in Figure 1 is computed. Rather than a traditional approach of using binary entries, we introduce a new method in this variant by having the adjacency matrix store the values corresponding to the weighted edges of the process graph. For instance, the weighted adjacency matrix for the BPI’12 (W) dataset would then look like:

Similary, the weighted adjacency matrix for the BPI’12 (W) [no repeats] dataset can be computed to be the following:

By comparing and , we can observe that a weighted adjacency matrix can capture the fact that the BPI’12 (W) [no repeats] dataset has fewer instances of an event being followed by itself in a given case instance. A normalisation procedure similar to the one explained in the previous section, is then applied to this adjacency matrix in the GCN layer.

3.4.2. GCN with Binary Adjacency matrix

This variant uses the binary adjacency matrix shown in the previous section (B ). The degree matrix is computed, following which a symmetrically normalised adjacency matrix is obtained.

3.4.3. GCN with Laplacian Transform of Weighted Adjacency Matrix

The Laplacian matrix of a graph is defined as (Godsil and Royle, 2013):

(5)

where D is the Degree matrix and the A is the Adjacency matrix. For example, the Laplacian matrix corresponding to the weighted adjacency matrix (W) for the BPI’12 (W) dataset is computed to be the following:

In this variant, this Laplacian matrix is then used for all computations involved within the Graph Convolutional layer.

3.4.4. GCN with Laplacian Transform of Binary Adjacency Matrix

This variant is different from the previous one, only for the fact that it uses the binary adjacency matrix (B) instead of the weighted adjacency matrix (W). For example, the Laplacian matrix computed for the BPI’12 (W) dataset using its binary adjacency matrix is:

3.4.5. Multi-Layer Perceptron (MLP)

In order to understand if the GCN layer added any significant change to the performance, we used a variant which had only the linear layers (omitting the GCN layer). The feature vector was flattened and given as input to the linear layers. As in the other variants, Dropout is used before the last layer. Hence, the dimensions for the input vectors of the MLP was (number_of_nodes x number_of_features).

All variants excluding MLP, takes as input, the corresponding normalised adjacency matrix of the process graph and the input feature matrix for the Graph Convolutional layer. It returns an embedding for each of the nodes after a linear activation. This is then fed into a Sequential network with three linear layers, which uses Dropout. In case of the Multi-layer perceptron (MLP), the input is directly provided to the Sequential layer.

The Event Predictor Network has tanh activation for the first two linear layers. Cross-entropy loss is used during training. The Timestamp Predictor Network on the other hand consists of ReLU

activation for the first two layers. The training process uses the Mean Absolute Error as the loss function. An Adam optimizer

(Kingma and Ba, 2014) is used for both the training processes for all variants. Each of the datasets is divided into train and test sets with a split ratio of 33%, with 20% of the training data being used as a validation set during the training process.

3.5. Metrics

Each row is associated with two labels, the next activity and the time (in seconds) after which the next event in that case takes place. As in (Tax et al., 2017), an additional label to denote the end of a case is added. But unlike (Tax et al., 2017) and other works, we have not divided the dataset chronologically. Even though the same split ratio (2/3 training data and 1/3 test data) has been maintained, a random split method is adopted for better generalisation.

The quality of the next activity is measured in terms of the accuracy of predicting the correct label. In the case of timestamp prediction, we have used the Mean Absolute Error (MAE) calculated in days for comparison with the results reported in previous works.

We have evaluated the performance of each variant at different quartiles. The quartiles for each case instance have been computed in two different ways: based on the number of events and based on the case duration.

In order to compute the quartiles based on the number of events, the total number of events in each case instance is first calculated. This is then divided into four equal parts for each of the case instances. Each event is then assigned a corresponding quartile number (1,2,3 or 4) depending on which portion it belongs to. This is equivalent to numerically computing the following: first compute a ratio between the number of events since beginning of the case until that particular event and the total number of events in that case, then scale this ratio to four.

The computation based on case duration is slightly different for the fact that each quartile may not contain equal number of data points for a given case instance. The division into quartiles in this scenario is based on dividing the total case duration into four equal parts. This results in each quartile corresponding to a specific time duration after the beginning of a given case instance. Each individual event in the case is then assigned a quartile number depending on the timestamp at which that particular event occurred after the beginning of that case. This is equivalent to numerically first finding a ratio between the time since beginning of the case for that particular event and the total case duration, and then scaling this ratio to four.

The main motivation behind investigating the performance of each of the model variants at different quartiles was to identify if there were any interesting patterns for the predictions at different stages of a process. For the events in the last quartile, the event-predictor also takes into consideration predicting the end of the case as it falls among one of the classes. This approach would give us a generic overview of the performance of each model variant.

4. Results

Dataset Model Accuracy for Event Prediction
Quartiles (Based on Number of Events)
1 2 3 4 All
Helpdesk GCN (Weighted) 0.733 0.7006 0.7774 0.9596 0.8086
GCN (Binary) 0.7287 0.6867 0.7798 0.9391 0.798
GCN (Laplacian On Binary) 0.733 0.702 0.8145 0.9343 0.811
GCN (Laplacian On Weighted) 0.733 0.705 0.8065 0.9446 0.813
MLP 0.733 0.7079 0.8056 0.963 0.8196
BPI’12 (W) GCN (Weighted) 0.7354 0.7912 0.8236 0.4062 0.6671
GCN (Binary) 0.7203 0.7852 0.8206 0.4251 0.6677
GCN (Laplacian On Binary) 0.7364 0.7886 0.818 0.4073 0.6657
GCN (Laplacian On Weighted) 0.612 0.735 0.8024 0.5415 0.6649
MLP 0.7347 0.7716 0.818 0.4548 0.6758
BPI’12 (W)
[No repeats]
GCN (Weighted) 0.5681 0.6158 0.4408 0.567 0.5544
GCN (Binary) 0.976 0.7158 0.356 0.4741 0.5729
GCN (Laplacian On Binary) 0.975 0.7271 0.4016 0.448 0.5756
GCN (Laplacian On Weighted) 0.9654 0.6617 0.3361 0.5224 0.5707
MLP 0.9875 0.8054 0.4515 0.4992 0.6302
Table 2. Comparison of the accuracy values for next-event prediction, obtained from all model variants at different stages of a process (indicated by quartiles estimated based on the number of events)
Dataset Model MAE (in days) for Time Prediction
Quartiles (Based on Number of Events)
1 2 3 4 All
Helpdesk GCN (Weighted) 0.3114 0.3408 0.3739 0.077 0.2617
GCN (Binary) 0.3217 0.363 0.3848 0.0692 0.2699
GCN (Laplacian On Binary) 0.3656 0.3651 0.3629 0.0788 0.2721
GCN (Laplacian On Weighted) 0.3566 0.371 0.3857 0.0692 0.2761
MLP 0.1506 0.1615 0.214 0.0358 0.1342
BPI’12 (W) GCN (Weighted) 0.3616 0.3697 0.3751 0.3402 0.36
GCN (Binary) 0.3542 0.3612 0.3665 0.3545 0.3589
GCN (Laplacian On Binary) 0.3752 0.3677 0.3663 0.3399 0.3603
GCN (Laplacian On Weighted) 0.369 0.3617 0.361 0.3413 0.3567
MLP 0.2933 0.3091 0.3155 0.3332 0.3148
BPI’12 (W)
[No repeats]
GCN (Weighted) 0.443 0.4411 0.4961 0.1514 0.3399
GCN (Binary) 0.3629 0.4027 0.5018 0.1923 0.3373
GCN (Laplacian On Binary) 0.4113 0.419 0.4884 0.1643 0.3335
GCN (Laplacian On Weighted) 0.4443 0.4399 0.4798 0.16 0.3395
MLP 0.3528 0.3505 0.3863 0.1887 0.2952
Table 3. Comparison of the MAE values (in days) for predicting the timestamp of the next-event, obtained from all model variants at different stages of a process (indicated by quartiles estimated based on the number of events)
Dataset Model Accuracy for Event Prediction
Quartiles (Based on Case Duration)
1 2 3 4 All
Helpdesk GCN (Weighted) 0.7599 0.552 0.6188 0.9119 0.8086
GCN (Binary) 0.7512 0.552 0.5856 0.8999 0.798
GCN (Laplacian On Binary) 0.7685 0.6108 0.6188 0.901 0.811
GCN (Laplacian On Weighted) 0.7698 0.5973 0.6298 0.9046 0.813
MLP 0.7739 0.5973 0.6298 0.9156 0.8196
BPI’12 (W) GCN (Weighted) 0.7616 0.9079 0.8139 0.4358 0.6671
GCN (Binary) 0.75 0.909 0.8112 0.4497 0.6677
GCN (Laplacian On Binary) 0.7639 0.9004 0.8053 0.4354 0.6657
GCN (Laplacian On Weighted) 0.6718 0.8967 0.7864 0.5337 0.6649
MLP 0.7632 0.9018 0.7968 0.4664 0.6758
BPI’12 (W)
[No repeats]
GCN (Weighted) 0.5226 0.469 0.685 0.5766 0.5544
GCN (Binary) 0.719 0.4069 0.4814 0.4657 0.5729
GCN (Laplacian On Binary) 0.7106 0.4754 0.5982 0.4575 0.5756
GCN (Laplacian On Weighted) 0.6991 0.2912 0.469 0.493 0.5707
MLP 0.7462 0.5889 0.731 0.5138 0.6302
Table 4. Comparison of the accuracy values for next-event prediction, obtained from all model variants at different stages of a process (indicated by quartiles estimated based on the time duration)
Dataset Model MAE (in days) for Time Prediction
Quartiles (Based on Case Duration)
1 2 3 4 All
Helpdesk GCN (Weighted) 0.3437 0.3814 0.3842 0.1422 0.2617
GCN (Binary) 0.3599 0.3939 0.3846 0.1415 0.2699
GCN (Laplacian On Binary) 0.3709 0.3717 0.364 0.1385 0.2721
GCN (Laplacian On Weighted) 0.3785 0.3848 0.3851 0.1358 0.2761
MLP 0.177 0.2234 0.2204 0.0666 0.1342
BPI’12 (W) GCN (Weighted) 0.3588 0.3622 0.3928 0.3479 0.3601
GCN (Binary) 0.3498 0.358 0.3817 0.3596 0.3589
GCN (Laplacian On Binary) 0.3705 0.3551 0.3809 0.3438 0.3603
GCN (Laplacian On Weighted) 0.3645 0.3528 0.3763 0.3426 0.3567
MLP 0.3018 0.3239 0.3275 0.32 0.3148
BPI’12 (W)
[No repeats]
GCN (Weighted) 0.4394 0.5087 0.5131 0.2079 0.3399
GCN (Binary) 0.3709 0.5815 0.5754 0.2507 0.3373
GCN (Laplacian On Binary) 0.4091 0.528 0.5302 0.2181 0.3335
GCN (Laplacian On Weighted) 0.4432 0.4782 0.4812 0.2108 0.3395
MLP 0.3495 0.3808 0.3799 0.2251 0.2952
Table 5. Comparison of the MAE values (in days) for predicting the timestamp of the next-event, obtained from all model variants at different stages of a process (indicated by quartiles estimated based on the time duration)

4.1. Helpdesk Dataset

The Helpdesk dataset is a relatively smaller dataset where the Event Predictor aims to predict the next activity to be one among 10 classes. The model corresponding to the best validation loss is saved for all the model variants, and then evaluated on the same test set. To understand the quality of predictions at different stages of the process, we have measured the accuracy and Mean Absolute Error (MAE) for different quartiles of the process execution as well. As explained in the previous section, quartiles can be computed in two ways based on the number of events and the case duration.

Each of the model variants was initially run with different learning rates for the Adam optimizer. The accuracy and MAE values corresponding to different values of the hyper-parameter along with the training plots have been provided in the ’Supplementary materials’. For all the GCN variants, the best performance for the Timestamp-predictor was obtained with a learning rate of 0.001. For the Event-predictor, the GCN variant with weighted adjacency matrix gave the best performance at a learning rate of 0.001 and all other GCN variants performed best at 0.0001. For the MLP model, both the tasks gave best results at a learning rate of 0.0001.

The accuracy values corresponding to the event prediction task achieved in this process is presented in Table 4 and Table 2. A similar procedure is also applied for the timestamp prediction task. The Mean Absolute Error (in days) achieved on a test set from models saved for the different variants is shown in Table 5 and Table 3.

All the models perform very well for both the tasks on this dataset. It can be observed from Table 5 and Table 3 that the MLP model outperforms all other variants for the time prediction task, in all individual quartiles as well as for the overall Mean Absolute Error. Considering the quartiles based on time duration, it can be observed that even though it shows similar quality in the overall results for the Event Predictor; the GCN variant with the Laplacian transform of binary adjacency matrix outperforms the MLP in the second quartile and the GCN variant with the Laplacian transform of weighted adjacency matrix matches the performance of MLP in the third quartile. In the case of quartiles based on the number of events, we observe a slightly different trend. Almost all the variants perform equally well in the first quartile. The MLP model achieves the best performance in all individual quartiles except the third, where the GCN variant with the Laplacian transform of the binary adjacency matrix outperforms all other variants.

For all entries irrespective of their quartiles, a maximum accuracy of 81.96% is obtained for the Event Predictor and the Timestamp Predictor achieves a minimum MAE of 0.1342 days.

4.2. BPI’12 dataset

A procedure exactly similar to the Helpdesk dataset has been used in the evaluation of the next-activity prediction and the timestamp prediction tasks for the BPI’12 (W) dataset as well. The computation of quartiles is also the same as before. The accuracy values for event prediction on a test set using different model variants, has been reported in Table 4 and Table 2. Similarly, the results corresponding the MAE values (in days) for the time prediction task is presented in Table 5 and Table 3.

All the model variants were run with different initial learning rates for the Adam optimizer, and the models with the best results were used for further experiments. All the values have been tabulated and plotted in the ’Supplementary materials’. The Timestamp-predictor for all variants gave the best results with a learning rate of 0.0001. It is also the preferred learning rate for the Event-predictor in all GCN variants, except the one using the Laplacian transform of the binary adjacency matrix where it is 0.001. The Event-predictor for the MLP model gives the best results at a learning rate of 0.00001.

As in the case for Helpdesk dataset, the MLP model outperforms all other variants in the time prediction task achieving an overall minimum MAE of 0.3148 days. We are able to observe slight variations when it comes to the results of the Event Predictor. Focusing first on quartiles sorted according the number of events (Table 2), we can observe that even though the MLP achieves the highest overall accuracy for events from all quartiles in the Event Predictor results; it fails to outperform the GCN variants in individual quartiles. The GCN variant with the weighted adjacency matrix achieves the best performance in the second and third quartiles. A somewhat similar trend can be observed for the results of the Event predictor from Table 4. The variants which achieve the best performance in individual quartiles remain the same, except for the second quartile. A maximum accuracy of 67.58 % is achieved for the event-prediction task on data entries from all quartiles.

4.3. BPI’12 (W, no repeats) dataset

This version of the BPI’12 (W) dataset is different from the previous one for the fact that, it has lesser instances of the same event being repeated in a given case instance. This is reflected on comparing the weighted adjacency matrices provided in Section 3.4.1.

The optimal learning rate for the Adam optimizer for all variants is different for each task. The GCN variant with the weighted adjacency matrix and the MLP model gives the best results with a learning rate of 0.0001. The GCN variant with the binary adjacency matrix and the one using the Laplacian transform of the binary adjacency matrix, give best results at a learning rate value of 0.0001 for the Event-predictor and 0.001 for the Timestamp-predictor. For the GCN variant with the Laplacian transform of the weighted adjacency matrix, the optimal value for the learning rate is 0.00001 for the Event-predictor and 0.0001 for the Timestamp-predictor. All the training plots and performance values have been presented in the ’Supplementary materials’.

It can be observed from Table 4 and Table 2 that there have been no improvements in the event prediction accuracy over the BPI’12 (W) dataset. But unlike BPI’12 (W) dataset, interestingly , the MLP achieves the best results in all individual quartiles except the last one. The GCN variant with the weighted adjacency matrix outperforms all other variants in the last quartile for both scenarios of quartile estimation. But the MAE for the time prediction task seem to have improved from BPI’12 (W) dataset, as seen in Table 5 and Table 3. Also, the GCN variant with weighted adjacency matrix outperforms the MLP model in the last quartile for the cases in the Time predictor as well. Hence it can be deduced that the GCN variant with the weighted adjacency matrix performs the best in the last quartile for both the tasks on the BPI’12 (W) [no repeats] dataset.

The maximum accuracy achieved for the Event predictor is 63.02 % and the minimum Mean Absolute Error obtained for the Time predictor is 0.2952 days.

5. Discussion and Lessons Learned

Model Accuracy for Event Prediction MAE (in days) for time prediction
Helpdesk BPI’12 (W)
BPI’12 (W)
[no repeats]
Helpdesk BPI’12 (W)
BPI’12 (W)
[no repeats]
GCN (Weighted) 0.8086 0.6671 0.5544 0.2617 0.3601 0.3399
GCN (Binary) 0.798 0.6677 0.5729 0.2699 0.3589 0.3373
GCN (Laplacian On Weighted) 0.813 0.6649 0.5707 0.2761 0.3567 0.3395
GCN (Laplacian On Binary) 0.811 0.6657 0.5756 0.2721 0.3603 0.3335
MLP 0.8196 0.6758 0.6302 0.1342 0.3148 0.2952
N Tax et al. (Tax et al., 2017) 0.7123 0.76 3.75 1.56 -
V Pasquadibisceglie et al. (Pasquadibisceglie et al., 2019) 0.7393 0.7817 - - -
Evermann et al. (Evermann et al., 2016) - 0.623 - - -
Breuker et al. (Breuker et al., 2016) - 0.719 - - -
WMP Van der Aalst et al. (Van der Aalst et al., 2011) - - 5.67 1.91 -
Table 6. Comparison of the results obtained from all the model variants, with other reported results on the same benchmark datasets

As mentioned in Section 2, the task of event prediction and the timestamp prediction has been explored in various other works as well, using other techniques. Table 6 compiles the best results reported in other works and compares them with the results obtained from our approach.

Tax et al. (Tax et al., 2017) have performed both these tasks on the same datasets, and is hence a strong competitor. Pasquadibisceglie et al.(Pasquadibisceglie et al., 2019) have performed the next activity prediction on both of these datasets. Evermann et al. (Evermann et al., 2016) and Breuker et al. (Breuker et al., 2016) have performed the next-activity prediction task on the BPI’12 (W) dataset. WMP Van der Aalst et al. (Van der Aalst et al., 2011) have performed the time-prediction task on both these datasets. The BPI (W) [no repeats] dataset has not been used for these specific tasks in any of the previous works.

The reported results for the event prediction task and the task of predicting the timestamp for the next event, as reported in the respective papers, have been compiled in Table 6. The results obtained on these datasets using the different model variants proposed in this work have also been shown in Table 6 for comparison.

It can be observed from Table 6 that all the model variants used in this work have clearly outperformed previous models for the time-prediction task by a relatively large margin. For the event prediction task, we have mixed results. On the Helpdesk dataset, all the model variants outperform both the LSTM model (Tax et al., 2017) and the Convolutional Neural Network model (Pasquadibisceglie et al., 2019). But this model fails to improve on the accuracy obtained on the BPI’12 (W) dataset. The BPI’12 (no repeats) version was used in this work to investigate on this aspect of the model. But it can be seen that reducing the number of occurrences for similar events, doesn’t improve the prediction accuracy for the next event.

Even though the feature vector representation used in this work was inspired from the need to be used in a Graph Convolutional layer, it can be observed from the results that the GCN layer does not add much improvements to a linear neural network which takes the raw feature vectors as input. As seen in the results presented here, the MLP model is the variant which achieves the best overall results out of all the five variants used in this work for all the tasks. One possible explanation for this behaviour might be that, an improved feature vector representation as compared to other works like the LSTM model (Tax et al., 2017), might have improved the performance of the deep learning model used. Also considering the fact that the number of classes in the event prediction task for all the datasets is not that high ( 9+1 classes for Helpdesk dataset and 6+1 classes for the BPI’12 dataset ), the simple MLP models would have been more effective in learning correlations between input features and the target labels. In future works, it would be interesting to test these models on datasets that have more number of classes for the event-prediction task.

A potential risk to the validity of these results can be from one of the assumptions we had used during the pre-processing stage. We have used the last occurrence of an event, when there are recurring events in a particular case. But as it can be seen from the time-prediction results, this assumption has not affected the prediction for the timestamp of the next event in any way. In the case of the event-prediction task, the accuracy values for the Helpdesk dataset are significantly high as well. In spite of these results, it would be interesting to look into other variations in representing the feature vector as well.

6. Conclusions

In general, our experiments show that the natural method for representing an event log is in the form of a graph structure having feature vectors corresponding to each node. The 9 x 4 matrix representation for the Helpdesk dataset and the 6 x 4 matrix representation for the BPI’12 (W) dataset, is in accordance to this graph structure representation. A Multi-layer Perceptron (MLP) architecture taking inputs represented in this graphical form, performs much better than other models. This high performance from the MLP model is surprising and further studies need to be conducted to efficiently explain this behaviour.

There are different directions for future works inspired from these models. A more efficient and intelligent way to deal with recurring events in a given case instance can be worked on. Also, another interesting aspect would be to include unstructured data like textual data into the feature vectors for each nodes. This can be done by combining techniques from text analytics within the feature vector representation that has been currently provided.

Appendix A Supplementary Material

a.1. Event Logs

A typical event log gives an organised list of different case instances for a particular type of process. All the distinct case instances can be identified with the ’Case ID’ identifier. Each row in the dataset can then be interpreted as an individual event that takes place in any of these case instances. Each distinct event has an associated ’Activity ID’ label. An event log will also store the timestamp at which a particular event occurs, which is denoted by the ’Complete Timestamp’. Hence, the event logs used in this work have three columns as shown in Figure 3: Case ID, Activity ID and the Complete Timestamp.

Figure 3. The first five entries in the event logs for the Helpdesk dataset (left), BPI’12 (W) dataset in the middle and BPI’12 (W,no repeats) dataset on the right

A process can represent any sequence of events that can take place in a business firm. As a simple dummy example, consider the imaginary process depicted in Figure 4.

Figure 4. An example for a business process

In Figure 4, the numbers denoted in the brackets for each of the events can be interpreted as the Activity ID. So different case instances for this kind of process can have various forms, some of which have been given in Table 7.

Variant 1 Variant 2 Variant 3
(1) Register Complaint (1) Register Complaint (1) Register Complaint
(2) Examine Complaint (2) Examine Complaint (2) Examine Complaint
(3) Provide debugging instructions (5) Assign Technician (5) Assign Technician
(4) Complaint Resolved (6) Check if resolved (6) Check if resolved
(4) Complaint Resolved (5) Assign Technician
(6) Check if resolved
(4) Complaint Resolved
Table 7. Possible variants of case instances for the dummy example

Each of the variants depicted in Table 7 can be interpreted as distinct Case IDs. The datasets used in this work also correspond to similar processes taking place in their respective domains. But unlike the dummy example, we are not able to directly produce a process graph as shown in Figure 4. Hence, we have to use the concept of Directly-follows graphs (Equation 2) to obtain process graphs (Figure 1).

a.2. Graph Convolutional Networks (GCN)

The Graph Convolutional Layer used in this work is a basic model of a GCN, which performs the classical convolution operation on a graph (Equation 3). There have been a lot of recent advancements towards more sophisticated GCN architectures. There are two major categories of modern GCNs, which are mainly focused on spectral-based(Shuman et al., 2013) and spatial-based(Micheli, 2009) approaches.

In general, a graph convolution operation is similar to a traditional convolution operation on images. The major difference lies in the fact that GCNs deal with irregular data structures represented in the form of graphs. The information stored in a graph can be systematically represented by its adjacency matrix. Many real-world data can be stored as graphs. For instance, the Cora dataset

555http://networkrepository.com/cora.php(Rossi and Ahmed, 2015) is a citation network consisting of scientific publications sorted into seven different classes. Data on social networks and communication networks666https://snap.stanford.edu/data/ are some other examples where data can be stored in the form of graph-structures.

In this work, the data-flow through the Graph Convolutional layer can be represented as shown in Figure 5. The output from the GCN layer is then fed into a Sequential layer consisting of three linear layers with dropout. Adjacency matrix is derived from the process graphs and the feature matrix is computed for each row in a given dataset.

Figure 5. Flow-diagram for the GCN layer used in this work

a.3. Training Process for GCN Model (Weighted Adjacency Matrix)

The training process for the variant involving a Graph Convolutional Layer using the weighted adjacency matrix, has been depicted in Figure 6 for different learning rates. These learning rate values correspond to that of the Adam optimizer that has been used for training all the model variants in this work. As seen in Table 8, the learning rate value of 0.001 and 0.0001 have produced the best results for the Helpdesk dataset and the BPI’12 datasets respectively.

Learning Rate Dataset
Helpdesk BPI’12 (W) BPI’12 (W) [no repeats]
Accuracy MAE (days) Accuracy MAE (days) Accuracy MAE (days)
0.001 0.8086 0.2617 0.6138 0.3699 0.5185 0.3402
0.0001 0.7973 0.3031 0.6671 0.36 0.5544 0.3399
0.00001 0.7637 0.3323 0.6597 0.3726 0.5118 0.3466
Table 8. Accuracy values for the next event prediction and Mean Absolute Error (in days) for the timestamp prediction, on the three different datasets. The values correspond to the GCN model with weighted adjacency matrix.
Figure 6. The training process for the GCN model using weighted adjacency matrix showing: the cross-entropy loss for the Event Predictor and the Mean Absolute Error (in days) for the Timestamp Predictor, at different learning rates on the validation set.

a.4. Training Process for GCN Model (Binary Adjacency Matrix)

The training process for the variant using the Binary adjacency matrix has been shown in Figure 7. The values given in bold letters within Table 9 correspond to the best model for each of the tasks on the three different datasets.

Learning Rate Dataset
Helpdesk BPI’12 (W) BPI’12 (W) [no repeats]
Accuracy MAE (days) Accuracy MAE (days) Accuracy MAE (days)
0.001 0.7743 0.2699 0.6631 0.3614 0.4532 0.3373
0.0001 0.798 0.3147 0.6677 0.3589 0.5729 0.3635
0.00001 0.7743 0.323 0.6503 0.3826 0.5479 0.3457
Table 9. Accuracy values for the next event prediction and Mean Absolute Error (in days) for the timestamp prediction, on the three different datasets. The values correspond to the GCN model with binary adjacency matrix.
Figure 7. The training process for the GCN model using binary adjacency matrix showing: the cross-entropy loss for the Event Predictor and the Mean Absolute Error (in days) for the Timestamp Predictor, at different learning rates on the validation set.

a.5. Training Process for GCN Model (Laplacian on Binary)

This variant uses the Laplacian transform of the binary adjacency matrix obtained from each of the datasets. It can be observed from Table 10 that the best models for each of the tasks correspond to learning rates of 0.001 or 0.0001. The training plots for each of the learning rates presented in Table 10, have been shown in Figure 8.

Learning Rate Dataset
Helpdesk BPI’12 (W) BPI’12 (W) [no repeats]
Accuracy MAE (days) Accuracy MAE (days) Accuracy MAE (days)
0.001 0.7922 0.2721 0.6657 0.3629 0.4655 0.3335
0.0001 0.811 0.2863 0.6527 0.3603 0.5756 0.3388
0.00001 0.7748 0.2993 0.6622 0.3737 0.5232 0.3662
Table 10. Accuracy values for the next event prediction and Mean Absolute Error (in days) for the timestamp prediction, on the three different datasets. The values correspond to the GCN model with Laplacian transform of binary adjacency matrix.
Figure 8. The training process for the GCN model using Laplacian transform of Binary adjacency matrix showing: the cross-entropy loss for the Event Predictor and the Mean Absolute Error (in days) for the Timestamp Predictor, at different learning rates on the validation set.

a.6. Training Process for GCN Model (Laplacian on Weighted)

This variant corresponds to the model involving computations in the Graph Convolutional layer using the Laplacian transform of the weighted adjacency matrix. Figure 9 shows the training plots and Table 11 compiles the accuracy and the mean absolute error for the event predictor and the timestamp predictor at different learning rates.

Learning Rate Dataset
Helpdesk BPI’12 (W) BPI’12 (W) [no repeats]
Accuracy MAE (days) Accuracy MAE (days) Accuracy MAE (days)
0.001 0.7807 0.2761 0.6597 0.3644 0.512 0.3418
0.0001 0.813 0.2929 0.6649 0.3567 0.5343 0.3395
0.00001 0.767 0.3275 0.6638 0.3672 0.5707 0.3461
Table 11. Accuracy values for the next event prediction and Mean Absolute Error (in days) for the timestamp prediction, on the three different datasets. The values correspond to the GCN model with Laplacian transform of weighted adjacency matrix.
Figure 9. The training process for the GCN model using Laplacian transform of weighted adjacency matrix showing: the cross-entropy loss for the Event Predictor and the Mean Absolute Error (in days) for the Timestamp Predictor, at different learning rates on the validation set.

a.7. Training Process for MLP model

As it was observed from the results of the experiments, the Multi-layer Perceptron (MLP) model gave the best performance for most of the tasks. The training plots corresponding to different learning rates for the MLP model has been shown in Figure 10. Table 12 tabulates all the accuracy and mean absolute error values.

Learning Rate Dataset
Helpdesk BPI’12 (W) BPI’12 (W) [no repeats]
Accuracy MAE (days) Accuracy MAE (days) Accuracy MAE (days)
0.001 0.8137 0.1402 0.6477 0.3247 0.5959 0.3006
0.0001 0.8196 0.1342 0.6757 0.3148 0.6302 0.2952
0.00001 0.8223 0.1374 0.6758 0.3196 0.6269 0.3016
Table 12. Accuracy values for the next event prediction and Mean Absolute Error (in days) for the timestamp prediction, on the three different datasets. The values correspond to the MLP model.
Figure 10. The training process for the MLP model showing: the cross-entropy loss for the Event Predictor and the Mean Absolute Error (in days) for the Timestamp Predictor, at different learning rates on the validation set.

a.8. Comparison of training plots for all model variants

All the model variants were tested with different initial learning rates for the Adam optimizer. For each variant, the training plots for models with the best performance have been plotted in Figure 11. All the values in the plot denote the scores obtained on the validation set during the training process. It is clear from these plots that the MLP model has the training curves with the best loss for most of the cases, as reflected in the results presented in Section 4.

Figure 11. Comparison of the learning process with the different model variants used in this work, for both the event predictor as well as the timestamp predictor

a.9. Source Code

The source code for all the experiments carried out in this work is available at https://github.com/ishwarvenugopal/gcn-process-mining.

References

  • R. Agrawal, D. Gunopulos, and F. Leymann (1998) Mining process models from workflow logs. In International Conference on Extending Database Technology, pp. 467–483. Cited by: §1.
  • J. Becker, D. Breuker, P. Delfmann, and M. Matzner (2014) Designing and implementing a framework for event-based predictive modelling of business processes. Enterprise modelling and information systems architectures-EMISA 2014. Cited by: §2.
  • A. Berti, S. J. van Zelst, and W. M. P. van der Aalst (2019)

    Process mining for python (pm4py): bridging the gap between process- and data science

    .
    CoRR abs/1905.06169. External Links: Link, 1905.06169 Cited by: §3.2.1.
  • D. Breuker, M. Matzner, P. Delfmann, and J. Becker (2016) Comprehensible predictive models for business processes.. MIS Q. 40 (4), pp. 1009–1034. Cited by: §1, §2, §2, Table 6, §5.
  • M. Castellanos, F. Casati, U. Dayal, and M. Shan (2004) A comprehensive and automated approach to intelligent business processes execution analysis. Distributed and Parallel Databases 16 (3), pp. 239–273. Cited by: §1.
  • X. Ding, Y. Zhang, T. Liu, and J. Duan (2015) Deep learning for event-driven stock prediction. In

    Twenty-fourth international joint conference on artificial intelligence

    ,
    Cited by: §2.
  • S. Esser and D. Fahland (2019) Using graph data structures for event logs. (English). Note: Report of a Capita Selecta research project. External Links: Document Cited by: §2.
  • J. Evermann, J. Rehse, and P. Fettke (2016) A deep learning approach for predicting process behaviour at runtime. In International Conference on Business Process Management, pp. 327–338. Cited by: §2, §2, Table 6, §5.
  • C. Godsil and G. F. Royle (2013) Algebraic graph theory. Vol. 207, Springer Science & Business Media. Cited by: §3.4.3.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.4.5.
  • T. N. Kipf and M. Welling (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §3.2.2, §3.2.2.
  • F. M. Maggi, C. Di Francescomarino, M. Dumas, and C. Ghidini (2014) Predictive monitoring of business processes. In International conference on advanced information systems engineering, pp. 457–472. Cited by: §1.
  • A. E. Márquez-Chamorro, M. Resinas, and A. Ruiz-Cortes (2017) Predictive monitoring of business processes: a survey. IEEE Transactions on Services Computing 11 (6), pp. 962–977. Cited by: §1.
  • L. Maruster, A. T. Weijters, W. W. Van der Aalst, and A. van den Bosch (2002) Process mining: discovering direct successors in process logs. In International Conference on Discovery Science, pp. 364–373. Cited by: §2.
  • A. Micheli (2009) Neural network for graphs: a contextual constructive approach. IEEE Transactions on Neural Networks 20 (3), pp. 498–511. Cited by: §A.2.
  • V. Pasquadibisceglie, A. Appice, G. Castellano, and D. Malerba (2019) Using convolutional neural networks for predictive process analytics. In 2019 International Conference on Process Mining (ICPM), pp. 129–136. Cited by: §1, §2, §2, §3.1.2, Table 6, §5, §5.
  • A. Rogge-Solti and M. Weske (2013) Prediction of remaining service execution time using stochastic petri nets with arbitrary firing delays. In International conference on service-oriented computing, pp. 389–403. Cited by: §2.
  • R. A. Rossi and N. K. Ahmed (2015) The network data repository with interactive graph analytics and visualization. In AAAI, External Links: Link Cited by: §A.2.
  • F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini (2008) The graph neural network model. IEEE Transactions on Neural Networks 20 (1), pp. 61–80. Cited by: §2.
  • D. I. Shuman, S. K. Narang, P. Frossard, A. Ortega, and P. Vandergheynst (2013)

    The emerging field of signal processing on graphs: extending high-dimensional data analysis to networks and other irregular domains

    .
    IEEE signal processing magazine 30 (3), pp. 83–98. Cited by: §A.2.
  • N. Tax, I. Verenich, M. La Rosa, and M. Dumas (2017) Predictive business process monitoring with lstm neural networks. In International Conference on Advanced Information Systems Engineering, pp. 477–492. Cited by: §1, §2, §2, §3.1.2, §3.3, §3.5, Table 6, §5, §5, §5.
  • P. N. Taylor (2017) Customer contact journey prediction. In International Conference on Innovative Techniques and Applications of Artificial Intelligence, pp. 278–290. Cited by: §2.
  • I. Teinemaa, M. Dumas, F. M. Maggi, and C. Di Francescomarino (2016) Predictive business process monitoring with structured and unstructured data. In International Conference on Business Process Management, pp. 401–417. Cited by: §2.
  • W. M. Van Der Aalst, H. A. Reijers, A. J. Weijters, B. F. van Dongen, A. A. De Medeiros, M. Song, and H. Verbeek (2007) Business process mining: an industrial application. Information Systems 32 (5), pp. 713–732. Cited by: §2.
  • W. M. Van der Aalst, M. H. Schonenberg, and M. Song (2011) Time prediction based on process mining. Information systems 36 (2), pp. 450–475. Cited by: §2, §2, Table 6, §5.
  • W. M. Van der Aalst, B. F. van Dongen, J. Herbst, L. Maruster, G. Schimm, and A. J. Weijters (2003) Workflow mining: a survey of issues and approaches.

    Data & knowledge engineering

    47 (2), pp. 237–267.
    Cited by: §1.
  • W. Van Der Aalst (2016) Data science in action. In Process mining, pp. 222–240. Cited by: §1, §3.2.1.
  • Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and S. Y. Philip (2020) A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems. Cited by: §2, §3.2.