gcnprocessmining
Source code for arXiv paper (https://arxiv.org/abs/2102.07838)
view repo
Deep learning models are now being increasingly used for predictive process mining tasks in business processes. Modern approaches have been successful in achieving better performance for different predictive tasks, as compared to traditional approaches. In this work, five different variants of a model involving a Graph Convolutional Layer and linear layers have been tested for the task of predicting the nature and timestamp of the next activity in a given process instance. We have introduced a new method for representing feature vectors for any individual event in a given process instance, taking into consideration the structure of Directlyfollows process graphs generated from the corresponding datasets. The adjacency matrix of the process graphs generated has been used as input to a Graph Convolutional Network (GCN). Different model variants make use of variations in the representation of the adjacency matrix. The performance of all the model variants have been tested at different stages of a process, determined by quartiles estimated based on the number of events and the case duration. The results obtained from the experiments, significantly improves over the previously reported results for most of the individual tasks. Interestingly, it was observed that a linear MultiLayer Perceptron (MLP) with dropout was able to outperform the GCN variants in both the prediction tasks. Using a quartilebased analysis, it was further observed that the other variants were able to perform better than MLP at individual quartiles in some of the tasks where the MLP had the best overall performance.
READ FULL TEXT VIEW PDF
Graph Convolutional Networks (GCNs) have emerged as the stateoftheart...
read it
We propose a novel neural network architecture, called
autoencoderconst...
read it
Owing to the remarkable capability of extracting effective graph embeddi...
read it
In this work, we focus on the problem of entity alignment in Knowledge G...
read it
Geometric matrix completion (GMC) has been proposed for recommendation b...
read it
This paper presents a novel, automated, generative adversarial networks ...
read it
Process models generated through process mining depict the asis state o...
read it
Source code for arXiv paper (https://arxiv.org/abs/2102.07838)
Most of the businesses thrive on the effective use of event logs and process records. The ability to predict the nature of an unseen event in a business process can have very useful applications (Breuker et al., 2016). This can help in more efficient customer service and facilitate in developing an improved workplan for companies.
The domain of process mining deals with combining a wide range of classical modelbased predictive techniques along with traditional data analysis techniques (Van Der Aalst, 2016). A process can be a representation of any set of activities that take place in a business enterprise; namely, the procedure for obtaining a financial document, the steps involved in a complaintregistering system etc. Business process mining, in general, deals with analysis of the sequence of events produced during the execution of such processes (Castellanos et al., 2004; Maggi et al., 2014; MárquezChamorro et al., 2017). A detailed explanation on such event logs is available in the ’Supplementary materials’. Even though the classical approach of depicting event logs is with the help of process graphs (Agrawal et al., 1998; Van der Aalst et al., 2003), Tax et al. (Tax et al., 2017) and Pasquadibisceglie et al. (Pasquadibisceglie et al., 2019)
have recently applied Deep Learning techniques like Long ShortTerm Memory (LSTM) networks and Convolutional Neural Networks (CNN) for the task of predictive process mining and obtained results that outperformed traditional models. Inspired from these works and taking into consideration the graph nature of processes, we aim to model event logs as graph structures and apply various deep learning models on such data structures. In this work, we have used a new representation for the event log data and investigated the performance of different variants of a Graph Neural Network (GNN) along with a linear MultiLayer Perceptron (MLP).
The rest of the paper is structured as follows: Section 2 discusses on related works that have been published in this direction. The datasets and the preprocessing techniques are introduced in Section 3. This section also deals with explanations regarding the experimental procedure that was followed and the different models used for the experiments. Section 4 highlights the major results obtained from the experiments that were carried out, followed by a discussion in Section 5. Finally, the major conclusions drawn from this work are compiled in Section 6.
Attribute  Dataset  

Helpdesk  BPI’12 (W) 


Total number of events  13710  72413  29410  
Total process instances/cases  3804  9658  9658  
No. of unique activities/events  9  6  6  
Average case duration (in seconds)  22474.71  1363.74  1363.74  
Average no. of events per case  3.604  7.498  3.045 
Business process mining generally deals with several prediction tasks; like predicting the next activity/event (Becker et al., 2014; Tax et al., 2017; Pasquadibisceglie et al., 2019; Evermann et al., 2016; Breuker et al., 2016), the timestamp of the next event in the process (Tax et al., 2017; Van der Aalst et al., 2011), the overall outcome of a given process (Taylor, 2017) or the time remaining until the completion of a given process instance (RoggeSolti and Weske, 2013). This work focuses on the first two aspects of the aforementioned list of predictive tasks, namely, the task of predicting the nature and timestamp of the next event in a given process.
There is a recent shift towards Deep Learning Models for the task of predictive business process monitoring. Tax et al. (Tax et al., 2017)
proposed to use a Recurrent Neural Network architecture using Long ShortTerm Memory (LSTM) networks for the task of predicting the next activity and timestamp, suffix length and the remaining cycle time. It was able to model the temporal properties of the data and improve on the results obtained from traditional process mining techniques. The main motivation for using an LSTM model was to obtain results that were consistent for a range of tasks and datasets. The LSTM architecture that was introduced could also be extended to the task of predicting the case outcome. Evermann et al.
(Evermann et al., 2016) had also attempted to use a Recurrent Neural Network for the task of predicting the next event. Pasquadibisceglie et al. (Pasquadibisceglie et al., 2019) used Convolutional Neural Networks (CNN) for the task of predictive process analytics. An imagelike data engineering approach was used to model the event logs and obtain results from benchmark datasets. In order to adapt a CNN for process mining tasks, a novel technique of transforming temporal data into a spatial structure similar to images was introduced. The results obtained from the CNN model improves over the accuracy scores obtained by the LSTM architecture(Tax et al., 2017) for the task of predicting the next event. In other works, there have been attempts to include features from unstructured data like texts into different Deep Learning architectures. Ding et al. (Ding et al., 2015) demonstrates how a Deep Learning model using an eventdriven approach was able to provide better stock predictions, by using eventdetection from texts. Teinemaa et al. (Teinemaa et al., 2016) also aimed to improve the performance of predictive business models by using textmining techniques on the unstructured data present in event logs. Some of the other approaches to solving this problem of eventprediction and timeprediction were based on traditional process mining tools (Breuker et al., 2016; Van der Aalst et al., 2011).Scarselli et al. (Scarselli et al., 2008)
introduced the Graph Neural Network (GNN) as a new competitor to Deep Learning techniques that could efficiently perform feature extraction. Graph datastructures provided a systematic method to represent complex relationships within a given data. Wu et al.
(Wu et al., 2020)provides a comprehensive survey of GNN techniques that have been implemented in different domains, by categorizing different architectures into Convolutional Graph Neural Networks (ConvGNN), Spatiotemporal Graph Neural Networks (STGNNs), Recurrent Graph Neural Networks (RecGNN) and Graph Autoencoders (GAEs).
Esser et al. (Esser and Fahland, 2019) discusses on the advantages of using graph structures to model event logs. Performing process mining tasks by modelling the relationships between events and case instances as process graphs, has been a widely accepted approach (Maruster et al., 2002; Van Der Aalst et al., 2007).
In this work, we aim to combine traditional process mining from event graphs along with Deep Learning techniques like Graph Convolutional Networks to achieve a better performance in predictive business process monitoring. We specifically focus on the task of predicting the next activity and its timestamp in a given instance of an incomplete process.
In this section, we start by briefly explaining the datasets used in this work and the methodology adopted for representing the feature vectors corresponding to each row in the dataset. Following this, a mathematical formulation of graphs and the specific case of process graphs is provided; which lays the foundation to understand a Graph Convolutional layer. We conclude this section with a description of the procedure and metrics adopted for this work.
The following two benchmark event log datasets have been used for this work:
This dataset presents event logs obtained at the helpdesk of an Italian software company. The events in the log corresponds to the activities associated with different process instances of a ticket management scenario. It is a database of 13710 events related to 3804 different process instances. Each process contains events from a list of nine unique activities involved in the process. A typical process instance spans events from inserting a new ticket, until its closed or resolved. The average case duration in terms of the time taken and the number of activities per case, has been depicted in Table 1.
The original Business Process Intelligence Challenge (BPIC’12) dataset^{3}^{3}3https://www.win.tue.nl/bpi/doku.php?id=2012:challenge&redirect=1id=2012/challenge contains event logs of a process consisting of three subprocesses, which in itself is a relatively large dataset. As described in (Tax et al., 2017) and (Pasquadibisceglie et al., 2019), only completed events that are executed manually are taken into consideration for predictive analysis. This includes 72413 events from 9658 process instances. Each event in a process is one among six different unique activities involved in a single process instance. The activities denote the steps involved in the application procedure for financial needs, like personal loans and overdrafts. We also took into consideration a different version of this dataset^{4}^{4}4https://github.com/verenich/ProcessSequencePrediction/blob/master/data/bpi_12_w_no_repeat.csv which was used in the suffix prediction task of (Tax et al., 2017). It has reduced instances of the same event following itself in any particular case instance, as shown in Table 1. This variant will be referred to as the BPI’12 dataset with ”no repeats” throughout this paper.
All the datasets are characterised by three columns; ’Case ID’, ’Activity ID’ and the ’Complete Timestamp’ denoting the time at which a particular event took place. A complete preliminary overview of the datasets have been presented in Table 1.
In mathematical terms, a graph can be represented as follows (Wu et al., 2020):
(1) 
where V is the set of nodes and E denotes the edges present between the nodes. e E denotes an edge between node i and node j. A graph can be either directed or undirected depending on the nature of interaction between the nodes. In addition, a graph may be characterised by node attributes or edge attributes; which in simple terms, are feature vectors associated with that particular node or edge.
Generally, the adjacency matrix of a graph is an n x n matrix with A = 1 if e E and A = 0 if e E, where n is the number of nodes in the graph. A degree matrix is a diagonal matrix which stores the degree of each node, which numerically corresponds to the number of edges that the node is attached to.
Process discovery from event logs can be achieved using different traditional process mining techniques. In this work, we have used an inductive mining approach with DirectlyFollows Graphs (DFG) to represent the processes extracted from each of the datasets. The choice is motivated by the simplicity and efficiency with which the entire data can be represented in the form of a graph.
A DirectlyFollows Graph for an Event Log L is denoted as (Van Der Aalst, 2016):
(2) 
where A is the set of activities in L with A and A denoting the set of start and end activities, respectively. denotes the directlyfollows operation, which exists between two events if and only if there is a case instance in which the source event is followed by the other target event. The nodes in the graph represent the unique activities present in the event log, and the directed edges of the graph exist if there is a directlyfollows relation between the nodes. The number of directlyfollows relations that exist between two nodes is denoted by a weight for the corresponding edge.
A Berti et al.(Berti et al., 2019) presented a processmining tool for the Python environment called PM4Py. The DirectlyFollows Graphs for both the datasets were visualised using the PM4Py package as shown in Figure 1.
A graph can be described and represented with its adjacency matrix. Combining the adjacency matrix with the feature vector representations of each node, can give rise to an efficient method to compute node embeddings. This operation is widely used in Graph Convolutional Networks (GCN). The convolution operation in a GCN layer can be mathematically denoted as follows (Kipf and Welling, 2016):
(3) 
where X is the input feature matrix containing the feature vector for each of the nodes, A is the adjacency matrix of the graph, D is the degree matrix, W is the learnable weight vector and
is the activation function of that layer.
Multiplying the inverse of the degree matrix with the adjacency matrix corresponds to the normalisation of the adjacency matrix. As matrix multiplication is influenced by the side from the which the component is multiplied from, some researchers prefer to use an alternate method to account for a symmetric normalisation as follows (Kipf and Welling, 2016):
(4) 
A detailed explanation regarding Graph Convolutional Networks is provided in the ’Supplementary materials’.
The timestamp corresponding to each event in the dataset can be used to derive a feature vector representation for each row in the data. The approach introduced in (Tax et al., 2017) has been used to initially get a feature vector having 4 elements. The various elements in this representation correspond to the following: time since last event in the case, time since case start, time since midnight and day of the week when that particular event occurred. This would then result in a 4element feature vector for every row in the dataset. The drawback in this kind of a representation is that it treats each event in a case independently, without considering the features for other events that have taken place in that particular case instance. In order to overcome this drawback, it was necessary for the feature vector of every event to have a history of other events that had already occurred for that particular Case ID. Hence, a new comprehensive feature vector representation was introduced.
In this work, we have used a matrix representation for each entry in the dataset. The dimensions of this matrix depends on the dataset which is considered. The number of rows in this matrix corresponds to the number of unique activities/events present in that particular dataset. This number can be obtained by identifying the unique entries in the ’Activity ID’ column or can be visually identified as the number of nodes in the process graphs for each of the datasets (Figure 1). Let us denote this value by ’numofnodes’ for ease of representation. As it can be observed from Table 1, numofnodes is 9 for the Helpdesk dataset and 6 for both the versions of BPI’12 (W) dataset. The number of columns in this matrix corresponds to the length of the initial feature vector i.e. 4. This would result in a matrix of size ’ x 4’ for each data entry.
The advantage of having such a representation is that the feature matrix of each data entry can now store the features corresponding to other events that have occurred in the same case instance. The first row of the feature matrix stores the 4element long feature vector for the event with Activity ID equal to 1 in that particular Case ID. Similarly the second row stores the features of the event with Activity ID equal to to 2 and so on. Hence, each row index stores the 4element long feature vector corresponding to the Activity ID denoted by that particular index. One approximation that we have used in this step is that; if an event corresponding to a particular Activity ID has occurred more than once in a case instance, we use only the last occurrence of that event. And in scenarios where events with a particular Activity ID has not occurred in a given case instance, the feature matrix will store a vector with zeroes corresponding to that Activity ID.
This method of representation gives each row a 9 x 4 matrix for the Helpdesk dataset and a 6 x 4 matrix for the BPI’12 (W) dataset. The motivation behind choosing such a representation is to facilitate the computation involved in a Graph Convolutional Layer, as explained in Section 3.2.2. To understand this statement further, consider the following binary adjacency matrix for the process graph generated from the BPI’12 (W) dataset:
It is a 6 x 6 matrix which needs to be normalised as per Equation 4, to be used in a Graph Convolutional layer. The degree matrix for this binary adjacency matrix, would be a diagonal matrix with entries corresponding to the number of edges connected to each node represented by that position. Numerically, this can be computed as a rowwise sum for the matrix. The degree matrix for the binary adjacency matrix of BPI’12 dataset is shown below:
The normalised version of , after multiplying with the inverse of as per Equation 4 is given below:
As this normalised matrix is now a 6 x 6 matrix, the dimension of the feature vector representation (6 x 4 for BPI’12 (W) dataset) makes it compatible for matrix multiplication in the GCN layer. In general, the normalised adjacency matrix will have dimensions x and an input feature matrix will have dimensions x 4.
The network depicted in Figure 2 shows the architecture for the model that learns the next Activity ID and the timestamp of the next activity. The overall structure that was constructed for this work, mainly focuses on a Graph Convolutional Layer followed by a Sequential layer consisting of three linear layers with Dropout. We have introduced five different variants of this general architecture, for the experiments carried out in this work. Each of these variants have been explained in the subsequent subsections.
The adjacency matrix of the process graph depicted in Figure 1 is computed. Rather than a traditional approach of using binary entries, we introduce a new method in this variant by having the adjacency matrix store the values corresponding to the weighted edges of the process graph. For instance, the weighted adjacency matrix for the BPI’12 (W) dataset would then look like:
Similary, the weighted adjacency matrix for the BPI’12 (W) [no repeats] dataset can be computed to be the following:
By comparing and , we can observe that a weighted adjacency matrix can capture the fact that the BPI’12 (W) [no repeats] dataset has fewer instances of an event being followed by itself in a given case instance. A normalisation procedure similar to the one explained in the previous section, is then applied to this adjacency matrix in the GCN layer.
This variant uses the binary adjacency matrix shown in the previous section (B ). The degree matrix is computed, following which a symmetrically normalised adjacency matrix is obtained.
The Laplacian matrix of a graph is defined as (Godsil and Royle, 2013):
(5) 
where D is the Degree matrix and the A is the Adjacency matrix. For example, the Laplacian matrix corresponding to the weighted adjacency matrix (W) for the BPI’12 (W) dataset is computed to be the following:
In this variant, this Laplacian matrix is then used for all computations involved within the Graph Convolutional layer.
This variant is different from the previous one, only for the fact that it uses the binary adjacency matrix (B) instead of the weighted adjacency matrix (W). For example, the Laplacian matrix computed for the BPI’12 (W) dataset using its binary adjacency matrix is:
In order to understand if the GCN layer added any significant change to the performance, we used a variant which had only the linear layers (omitting the GCN layer). The feature vector was flattened and given as input to the linear layers. As in the other variants, Dropout is used before the last layer. Hence, the dimensions for the input vectors of the MLP was (number_of_nodes x number_of_features).
All variants excluding MLP, takes as input, the corresponding normalised adjacency matrix of the process graph and the input feature matrix for the Graph Convolutional layer. It returns an embedding for each of the nodes after a linear activation. This is then fed into a Sequential network with three linear layers, which uses Dropout. In case of the Multilayer perceptron (MLP), the input is directly provided to the Sequential layer.
The Event Predictor Network has tanh activation for the first two linear layers. Crossentropy loss is used during training. The Timestamp Predictor Network on the other hand consists of ReLU
activation for the first two layers. The training process uses the Mean Absolute Error as the loss function. An Adam optimizer
(Kingma and Ba, 2014) is used for both the training processes for all variants. Each of the datasets is divided into train and test sets with a split ratio of 33%, with 20% of the training data being used as a validation set during the training process.Each row is associated with two labels, the next activity and the time (in seconds) after which the next event in that case takes place. As in (Tax et al., 2017), an additional label to denote the end of a case is added. But unlike (Tax et al., 2017) and other works, we have not divided the dataset chronologically. Even though the same split ratio (2/3 training data and 1/3 test data) has been maintained, a random split method is adopted for better generalisation.
The quality of the next activity is measured in terms of the accuracy of predicting the correct label. In the case of timestamp prediction, we have used the Mean Absolute Error (MAE) calculated in days for comparison with the results reported in previous works.
We have evaluated the performance of each variant at different quartiles. The quartiles for each case instance have been computed in two different ways: based on the number of events and based on the case duration.
In order to compute the quartiles based on the number of events, the total number of events in each case instance is first calculated. This is then divided into four equal parts for each of the case instances. Each event is then assigned a corresponding quartile number (1,2,3 or 4) depending on which portion it belongs to. This is equivalent to numerically computing the following: first compute a ratio between the number of events since beginning of the case until that particular event and the total number of events in that case, then scale this ratio to four.
The computation based on case duration is slightly different for the fact that each quartile may not contain equal number of data points for a given case instance. The division into quartiles in this scenario is based on dividing the total case duration into four equal parts. This results in each quartile corresponding to a specific time duration after the beginning of a given case instance. Each individual event in the case is then assigned a quartile number depending on the timestamp at which that particular event occurred after the beginning of that case. This is equivalent to numerically first finding a ratio between the time since beginning of the case for that particular event and the total case duration, and then scaling this ratio to four.
The main motivation behind investigating the performance of each of the model variants at different quartiles was to identify if there were any interesting patterns for the predictions at different stages of a process. For the events in the last quartile, the eventpredictor also takes into consideration predicting the end of the case as it falls among one of the classes. This approach would give us a generic overview of the performance of each model variant.
Dataset  Model  Accuracy for Event Prediction  

Quartiles (Based on Number of Events)  
1  2  3  4  All  
Helpdesk  GCN (Weighted)  0.733  0.7006  0.7774  0.9596  0.8086  
GCN (Binary)  0.7287  0.6867  0.7798  0.9391  0.798  
GCN (Laplacian On Binary)  0.733  0.702  0.8145  0.9343  0.811  
GCN (Laplacian On Weighted)  0.733  0.705  0.8065  0.9446  0.813  
MLP  0.733  0.7079  0.8056  0.963  0.8196  
BPI’12 (W)  GCN (Weighted)  0.7354  0.7912  0.8236  0.4062  0.6671  
GCN (Binary)  0.7203  0.7852  0.8206  0.4251  0.6677  
GCN (Laplacian On Binary)  0.7364  0.7886  0.818  0.4073  0.6657  
GCN (Laplacian On Weighted)  0.612  0.735  0.8024  0.5415  0.6649  
MLP  0.7347  0.7716  0.818  0.4548  0.6758  

GCN (Weighted)  0.5681  0.6158  0.4408  0.567  0.5544  
GCN (Binary)  0.976  0.7158  0.356  0.4741  0.5729  
GCN (Laplacian On Binary)  0.975  0.7271  0.4016  0.448  0.5756  
GCN (Laplacian On Weighted)  0.9654  0.6617  0.3361  0.5224  0.5707  
MLP  0.9875  0.8054  0.4515  0.4992  0.6302 
Dataset  Model  MAE (in days) for Time Prediction  

Quartiles (Based on Number of Events)  
1  2  3  4  All  
Helpdesk  GCN (Weighted)  0.3114  0.3408  0.3739  0.077  0.2617  
GCN (Binary)  0.3217  0.363  0.3848  0.0692  0.2699  
GCN (Laplacian On Binary)  0.3656  0.3651  0.3629  0.0788  0.2721  
GCN (Laplacian On Weighted)  0.3566  0.371  0.3857  0.0692  0.2761  
MLP  0.1506  0.1615  0.214  0.0358  0.1342  
BPI’12 (W)  GCN (Weighted)  0.3616  0.3697  0.3751  0.3402  0.36  
GCN (Binary)  0.3542  0.3612  0.3665  0.3545  0.3589  
GCN (Laplacian On Binary)  0.3752  0.3677  0.3663  0.3399  0.3603  
GCN (Laplacian On Weighted)  0.369  0.3617  0.361  0.3413  0.3567  
MLP  0.2933  0.3091  0.3155  0.3332  0.3148  

GCN (Weighted)  0.443  0.4411  0.4961  0.1514  0.3399  
GCN (Binary)  0.3629  0.4027  0.5018  0.1923  0.3373  
GCN (Laplacian On Binary)  0.4113  0.419  0.4884  0.1643  0.3335  
GCN (Laplacian On Weighted)  0.4443  0.4399  0.4798  0.16  0.3395  
MLP  0.3528  0.3505  0.3863  0.1887  0.2952 
Dataset  Model  Accuracy for Event Prediction  

Quartiles (Based on Case Duration)  
1  2  3  4  All  
Helpdesk  GCN (Weighted)  0.7599  0.552  0.6188  0.9119  0.8086  
GCN (Binary)  0.7512  0.552  0.5856  0.8999  0.798  
GCN (Laplacian On Binary)  0.7685  0.6108  0.6188  0.901  0.811  
GCN (Laplacian On Weighted)  0.7698  0.5973  0.6298  0.9046  0.813  
MLP  0.7739  0.5973  0.6298  0.9156  0.8196  
BPI’12 (W)  GCN (Weighted)  0.7616  0.9079  0.8139  0.4358  0.6671  
GCN (Binary)  0.75  0.909  0.8112  0.4497  0.6677  
GCN (Laplacian On Binary)  0.7639  0.9004  0.8053  0.4354  0.6657  
GCN (Laplacian On Weighted)  0.6718  0.8967  0.7864  0.5337  0.6649  
MLP  0.7632  0.9018  0.7968  0.4664  0.6758  

GCN (Weighted)  0.5226  0.469  0.685  0.5766  0.5544  
GCN (Binary)  0.719  0.4069  0.4814  0.4657  0.5729  
GCN (Laplacian On Binary)  0.7106  0.4754  0.5982  0.4575  0.5756  
GCN (Laplacian On Weighted)  0.6991  0.2912  0.469  0.493  0.5707  
MLP  0.7462  0.5889  0.731  0.5138  0.6302 
Dataset  Model  MAE (in days) for Time Prediction  

Quartiles (Based on Case Duration)  
1  2  3  4  All  
Helpdesk  GCN (Weighted)  0.3437  0.3814  0.3842  0.1422  0.2617  
GCN (Binary)  0.3599  0.3939  0.3846  0.1415  0.2699  
GCN (Laplacian On Binary)  0.3709  0.3717  0.364  0.1385  0.2721  
GCN (Laplacian On Weighted)  0.3785  0.3848  0.3851  0.1358  0.2761  
MLP  0.177  0.2234  0.2204  0.0666  0.1342  
BPI’12 (W)  GCN (Weighted)  0.3588  0.3622  0.3928  0.3479  0.3601  
GCN (Binary)  0.3498  0.358  0.3817  0.3596  0.3589  
GCN (Laplacian On Binary)  0.3705  0.3551  0.3809  0.3438  0.3603  
GCN (Laplacian On Weighted)  0.3645  0.3528  0.3763  0.3426  0.3567  
MLP  0.3018  0.3239  0.3275  0.32  0.3148  

GCN (Weighted)  0.4394  0.5087  0.5131  0.2079  0.3399  
GCN (Binary)  0.3709  0.5815  0.5754  0.2507  0.3373  
GCN (Laplacian On Binary)  0.4091  0.528  0.5302  0.2181  0.3335  
GCN (Laplacian On Weighted)  0.4432  0.4782  0.4812  0.2108  0.3395  
MLP  0.3495  0.3808  0.3799  0.2251  0.2952 
The Helpdesk dataset is a relatively smaller dataset where the Event Predictor aims to predict the next activity to be one among 10 classes. The model corresponding to the best validation loss is saved for all the model variants, and then evaluated on the same test set. To understand the quality of predictions at different stages of the process, we have measured the accuracy and Mean Absolute Error (MAE) for different quartiles of the process execution as well. As explained in the previous section, quartiles can be computed in two ways based on the number of events and the case duration.
Each of the model variants was initially run with different learning rates for the Adam optimizer. The accuracy and MAE values corresponding to different values of the hyperparameter along with the training plots have been provided in the ’Supplementary materials’. For all the GCN variants, the best performance for the Timestamppredictor was obtained with a learning rate of 0.001. For the Eventpredictor, the GCN variant with weighted adjacency matrix gave the best performance at a learning rate of 0.001 and all other GCN variants performed best at 0.0001. For the MLP model, both the tasks gave best results at a learning rate of 0.0001.
The accuracy values corresponding to the event prediction task achieved in this process is presented in Table 4 and Table 2. A similar procedure is also applied for the timestamp prediction task. The Mean Absolute Error (in days) achieved on a test set from models saved for the different variants is shown in Table 5 and Table 3.
All the models perform very well for both the tasks on this dataset. It can be observed from Table 5 and Table 3 that the MLP model outperforms all other variants for the time prediction task, in all individual quartiles as well as for the overall Mean Absolute Error. Considering the quartiles based on time duration, it can be observed that even though it shows similar quality in the overall results for the Event Predictor; the GCN variant with the Laplacian transform of binary adjacency matrix outperforms the MLP in the second quartile and the GCN variant with the Laplacian transform of weighted adjacency matrix matches the performance of MLP in the third quartile. In the case of quartiles based on the number of events, we observe a slightly different trend. Almost all the variants perform equally well in the first quartile. The MLP model achieves the best performance in all individual quartiles except the third, where the GCN variant with the Laplacian transform of the binary adjacency matrix outperforms all other variants.
For all entries irrespective of their quartiles, a maximum accuracy of 81.96% is obtained for the Event Predictor and the Timestamp Predictor achieves a minimum MAE of 0.1342 days.
A procedure exactly similar to the Helpdesk dataset has been used in the evaluation of the nextactivity prediction and the timestamp prediction tasks for the BPI’12 (W) dataset as well. The computation of quartiles is also the same as before. The accuracy values for event prediction on a test set using different model variants, has been reported in Table 4 and Table 2. Similarly, the results corresponding the MAE values (in days) for the time prediction task is presented in Table 5 and Table 3.
All the model variants were run with different initial learning rates for the Adam optimizer, and the models with the best results were used for further experiments. All the values have been tabulated and plotted in the ’Supplementary materials’. The Timestamppredictor for all variants gave the best results with a learning rate of 0.0001. It is also the preferred learning rate for the Eventpredictor in all GCN variants, except the one using the Laplacian transform of the binary adjacency matrix where it is 0.001. The Eventpredictor for the MLP model gives the best results at a learning rate of 0.00001.
As in the case for Helpdesk dataset, the MLP model outperforms all other variants in the time prediction task achieving an overall minimum MAE of 0.3148 days. We are able to observe slight variations when it comes to the results of the Event Predictor. Focusing first on quartiles sorted according the number of events (Table 2), we can observe that even though the MLP achieves the highest overall accuracy for events from all quartiles in the Event Predictor results; it fails to outperform the GCN variants in individual quartiles. The GCN variant with the weighted adjacency matrix achieves the best performance in the second and third quartiles. A somewhat similar trend can be observed for the results of the Event predictor from Table 4. The variants which achieve the best performance in individual quartiles remain the same, except for the second quartile. A maximum accuracy of 67.58 % is achieved for the eventprediction task on data entries from all quartiles.
This version of the BPI’12 (W) dataset is different from the previous one for the fact that, it has lesser instances of the same event being repeated in a given case instance. This is reflected on comparing the weighted adjacency matrices provided in Section 3.4.1.
The optimal learning rate for the Adam optimizer for all variants is different for each task. The GCN variant with the weighted adjacency matrix and the MLP model gives the best results with a learning rate of 0.0001. The GCN variant with the binary adjacency matrix and the one using the Laplacian transform of the binary adjacency matrix, give best results at a learning rate value of 0.0001 for the Eventpredictor and 0.001 for the Timestamppredictor. For the GCN variant with the Laplacian transform of the weighted adjacency matrix, the optimal value for the learning rate is 0.00001 for the Eventpredictor and 0.0001 for the Timestamppredictor. All the training plots and performance values have been presented in the ’Supplementary materials’.
It can be observed from Table 4 and Table 2 that there have been no improvements in the event prediction accuracy over the BPI’12 (W) dataset. But unlike BPI’12 (W) dataset, interestingly , the MLP achieves the best results in all individual quartiles except the last one. The GCN variant with the weighted adjacency matrix outperforms all other variants in the last quartile for both scenarios of quartile estimation. But the MAE for the time prediction task seem to have improved from BPI’12 (W) dataset, as seen in Table 5 and Table 3. Also, the GCN variant with weighted adjacency matrix outperforms the MLP model in the last quartile for the cases in the Time predictor as well. Hence it can be deduced that the GCN variant with the weighted adjacency matrix performs the best in the last quartile for both the tasks on the BPI’12 (W) [no repeats] dataset.
The maximum accuracy achieved for the Event predictor is 63.02 % and the minimum Mean Absolute Error obtained for the Time predictor is 0.2952 days.
Model  Accuracy for Event Prediction  MAE (in days) for time prediction  

Helpdesk  BPI’12 (W) 

Helpdesk  BPI’12 (W) 


GCN (Weighted)  0.8086  0.6671  0.5544  0.2617  0.3601  0.3399  
GCN (Binary)  0.798  0.6677  0.5729  0.2699  0.3589  0.3373  
GCN (Laplacian On Weighted)  0.813  0.6649  0.5707  0.2761  0.3567  0.3395  
GCN (Laplacian On Binary)  0.811  0.6657  0.5756  0.2721  0.3603  0.3335  
MLP  0.8196  0.6758  0.6302  0.1342  0.3148  0.2952  
N Tax et al. (Tax et al., 2017)  0.7123  0.76  3.75  1.56    
V Pasquadibisceglie et al. (Pasquadibisceglie et al., 2019)  0.7393  0.7817        
Evermann et al. (Evermann et al., 2016)    0.623        
Breuker et al. (Breuker et al., 2016)    0.719        
WMP Van der Aalst et al. (Van der Aalst et al., 2011)      5.67  1.91   
As mentioned in Section 2, the task of event prediction and the timestamp prediction has been explored in various other works as well, using other techniques. Table 6 compiles the best results reported in other works and compares them with the results obtained from our approach.
Tax et al. (Tax et al., 2017) have performed both these tasks on the same datasets, and is hence a strong competitor. Pasquadibisceglie et al.(Pasquadibisceglie et al., 2019) have performed the next activity prediction on both of these datasets. Evermann et al. (Evermann et al., 2016) and Breuker et al. (Breuker et al., 2016) have performed the nextactivity prediction task on the BPI’12 (W) dataset. WMP Van der Aalst et al. (Van der Aalst et al., 2011) have performed the timeprediction task on both these datasets. The BPI (W) [no repeats] dataset has not been used for these specific tasks in any of the previous works.
The reported results for the event prediction task and the task of predicting the timestamp for the next event, as reported in the respective papers, have been compiled in Table 6. The results obtained on these datasets using the different model variants proposed in this work have also been shown in Table 6 for comparison.
It can be observed from Table 6 that all the model variants used in this work have clearly outperformed previous models for the timeprediction task by a relatively large margin. For the event prediction task, we have mixed results. On the Helpdesk dataset, all the model variants outperform both the LSTM model (Tax et al., 2017) and the Convolutional Neural Network model (Pasquadibisceglie et al., 2019). But this model fails to improve on the accuracy obtained on the BPI’12 (W) dataset. The BPI’12 (no repeats) version was used in this work to investigate on this aspect of the model. But it can be seen that reducing the number of occurrences for similar events, doesn’t improve the prediction accuracy for the next event.
Even though the feature vector representation used in this work was inspired from the need to be used in a Graph Convolutional layer, it can be observed from the results that the GCN layer does not add much improvements to a linear neural network which takes the raw feature vectors as input. As seen in the results presented here, the MLP model is the variant which achieves the best overall results out of all the five variants used in this work for all the tasks. One possible explanation for this behaviour might be that, an improved feature vector representation as compared to other works like the LSTM model (Tax et al., 2017), might have improved the performance of the deep learning model used. Also considering the fact that the number of classes in the event prediction task for all the datasets is not that high ( 9+1 classes for Helpdesk dataset and 6+1 classes for the BPI’12 dataset ), the simple MLP models would have been more effective in learning correlations between input features and the target labels. In future works, it would be interesting to test these models on datasets that have more number of classes for the eventprediction task.
A potential risk to the validity of these results can be from one of the assumptions we had used during the preprocessing stage. We have used the last occurrence of an event, when there are recurring events in a particular case. But as it can be seen from the timeprediction results, this assumption has not affected the prediction for the timestamp of the next event in any way. In the case of the eventprediction task, the accuracy values for the Helpdesk dataset are significantly high as well. In spite of these results, it would be interesting to look into other variations in representing the feature vector as well.
In general, our experiments show that the natural method for representing an event log is in the form of a graph structure having feature vectors corresponding to each node. The 9 x 4 matrix representation for the Helpdesk dataset and the 6 x 4 matrix representation for the BPI’12 (W) dataset, is in accordance to this graph structure representation. A Multilayer Perceptron (MLP) architecture taking inputs represented in this graphical form, performs much better than other models. This high performance from the MLP model is surprising and further studies need to be conducted to efficiently explain this behaviour.
There are different directions for future works inspired from these models. A more efficient and intelligent way to deal with recurring events in a given case instance can be worked on. Also, another interesting aspect would be to include unstructured data like textual data into the feature vectors for each nodes. This can be done by combining techniques from text analytics within the feature vector representation that has been currently provided.
A typical event log gives an organised list of different case instances for a particular type of process. All the distinct case instances can be identified with the ’Case ID’ identifier. Each row in the dataset can then be interpreted as an individual event that takes place in any of these case instances. Each distinct event has an associated ’Activity ID’ label. An event log will also store the timestamp at which a particular event occurs, which is denoted by the ’Complete Timestamp’. Hence, the event logs used in this work have three columns as shown in Figure 3: Case ID, Activity ID and the Complete Timestamp.
A process can represent any sequence of events that can take place in a business firm. As a simple dummy example, consider the imaginary process depicted in Figure 4.
In Figure 4, the numbers denoted in the brackets for each of the events can be interpreted as the Activity ID. So different case instances for this kind of process can have various forms, some of which have been given in Table 7.
Variant 1  Variant 2  Variant 3 

(1) Register Complaint  (1) Register Complaint  (1) Register Complaint 
(2) Examine Complaint  (2) Examine Complaint  (2) Examine Complaint 
(3) Provide debugging instructions  (5) Assign Technician  (5) Assign Technician 
(4) Complaint Resolved  (6) Check if resolved  (6) Check if resolved 
(4) Complaint Resolved  (5) Assign Technician  
(6) Check if resolved  
(4) Complaint Resolved 
Each of the variants depicted in Table 7 can be interpreted as distinct Case IDs. The datasets used in this work also correspond to similar processes taking place in their respective domains. But unlike the dummy example, we are not able to directly produce a process graph as shown in Figure 4. Hence, we have to use the concept of Directlyfollows graphs (Equation 2) to obtain process graphs (Figure 1).
The Graph Convolutional Layer used in this work is a basic model of a GCN, which performs the classical convolution operation on a graph (Equation 3). There have been a lot of recent advancements towards more sophisticated GCN architectures. There are two major categories of modern GCNs, which are mainly focused on spectralbased(Shuman et al., 2013) and spatialbased(Micheli, 2009) approaches.
In general, a graph convolution operation is similar to a traditional convolution operation on images. The major difference lies in the fact that GCNs deal with irregular data structures represented in the form of graphs. The information stored in a graph can be systematically represented by its adjacency matrix. Many realworld data can be stored as graphs. For instance, the Cora dataset
^{5}^{5}5http://networkrepository.com/cora.php(Rossi and Ahmed, 2015) is a citation network consisting of scientific publications sorted into seven different classes. Data on social networks and communication networks^{6}^{6}6https://snap.stanford.edu/data/ are some other examples where data can be stored in the form of graphstructures.In this work, the dataflow through the Graph Convolutional layer can be represented as shown in Figure 5. The output from the GCN layer is then fed into a Sequential layer consisting of three linear layers with dropout. Adjacency matrix is derived from the process graphs and the feature matrix is computed for each row in a given dataset.
The training process for the variant involving a Graph Convolutional Layer using the weighted adjacency matrix, has been depicted in Figure 6 for different learning rates. These learning rate values correspond to that of the Adam optimizer that has been used for training all the model variants in this work. As seen in Table 8, the learning rate value of 0.001 and 0.0001 have produced the best results for the Helpdesk dataset and the BPI’12 datasets respectively.
Learning Rate  Dataset  

Helpdesk  BPI’12 (W)  BPI’12 (W) [no repeats]  
Accuracy  MAE (days)  Accuracy  MAE (days)  Accuracy  MAE (days)  
0.001  0.8086  0.2617  0.6138  0.3699  0.5185  0.3402 
0.0001  0.7973  0.3031  0.6671  0.36  0.5544  0.3399 
0.00001  0.7637  0.3323  0.6597  0.3726  0.5118  0.3466 
The training process for the variant using the Binary adjacency matrix has been shown in Figure 7. The values given in bold letters within Table 9 correspond to the best model for each of the tasks on the three different datasets.
Learning Rate  Dataset  

Helpdesk  BPI’12 (W)  BPI’12 (W) [no repeats]  
Accuracy  MAE (days)  Accuracy  MAE (days)  Accuracy  MAE (days)  
0.001  0.7743  0.2699  0.6631  0.3614  0.4532  0.3373 
0.0001  0.798  0.3147  0.6677  0.3589  0.5729  0.3635 
0.00001  0.7743  0.323  0.6503  0.3826  0.5479  0.3457 
This variant uses the Laplacian transform of the binary adjacency matrix obtained from each of the datasets. It can be observed from Table 10 that the best models for each of the tasks correspond to learning rates of 0.001 or 0.0001. The training plots for each of the learning rates presented in Table 10, have been shown in Figure 8.
Learning Rate  Dataset  

Helpdesk  BPI’12 (W)  BPI’12 (W) [no repeats]  
Accuracy  MAE (days)  Accuracy  MAE (days)  Accuracy  MAE (days)  
0.001  0.7922  0.2721  0.6657  0.3629  0.4655  0.3335 
0.0001  0.811  0.2863  0.6527  0.3603  0.5756  0.3388 
0.00001  0.7748  0.2993  0.6622  0.3737  0.5232  0.3662 
This variant corresponds to the model involving computations in the Graph Convolutional layer using the Laplacian transform of the weighted adjacency matrix. Figure 9 shows the training plots and Table 11 compiles the accuracy and the mean absolute error for the event predictor and the timestamp predictor at different learning rates.
Learning Rate  Dataset  

Helpdesk  BPI’12 (W)  BPI’12 (W) [no repeats]  
Accuracy  MAE (days)  Accuracy  MAE (days)  Accuracy  MAE (days)  
0.001  0.7807  0.2761  0.6597  0.3644  0.512  0.3418 
0.0001  0.813  0.2929  0.6649  0.3567  0.5343  0.3395 
0.00001  0.767  0.3275  0.6638  0.3672  0.5707  0.3461 
As it was observed from the results of the experiments, the Multilayer Perceptron (MLP) model gave the best performance for most of the tasks. The training plots corresponding to different learning rates for the MLP model has been shown in Figure 10. Table 12 tabulates all the accuracy and mean absolute error values.
Learning Rate  Dataset  

Helpdesk  BPI’12 (W)  BPI’12 (W) [no repeats]  
Accuracy  MAE (days)  Accuracy  MAE (days)  Accuracy  MAE (days)  
0.001  0.8137  0.1402  0.6477  0.3247  0.5959  0.3006 
0.0001  0.8196  0.1342  0.6757  0.3148  0.6302  0.2952 
0.00001  0.8223  0.1374  0.6758  0.3196  0.6269  0.3016 
All the model variants were tested with different initial learning rates for the Adam optimizer. For each variant, the training plots for models with the best performance have been plotted in Figure 11. All the values in the plot denote the scores obtained on the validation set during the training process. It is clear from these plots that the MLP model has the training curves with the best loss for most of the cases, as reflected in the results presented in Section 4.
The source code for all the experiments carried out in this work is available at https://github.com/ishwarvenugopal/gcnprocessmining.
Process mining for python (pm4py): bridging the gap between process and data science
. CoRR abs/1905.06169. External Links: Link, 1905.06169 Cited by: §3.2.1.Twentyfourth international joint conference on artificial intelligence
, Cited by: §2.The emerging field of signal processing on graphs: extending highdimensional data analysis to networks and other irregular domains
. IEEE signal processing magazine 30 (3), pp. 83–98. Cited by: §A.2.Data & knowledge engineering
47 (2), pp. 237–267. Cited by: §1.
Comments
There are no comments yet.