1. Introduction
Process Mining (PM) is a recent research field that advances Business Process Management (BPM) using Machine Learning (ML) and data mining solutions. Two first class citizens in this field are event logs extracted from (business) information systems, and explainable process models. The event logs contain sequential endtoend record of what happened in the
cases which are instances of the process. A process discovery algorithm extracts a process model out of such an event log. A process model is a graphical representation of the process behavior and define which sequences of process steps (activities) are possible. A process discovery algorithm extracts such a process model from the sequential process behavior observed as event sequences (traces) in reality and as recorded in an event log.Learning to predict how active instances or cases of a business process are likely to unfold in the future is the main goal of predictive process mining or predictive process monitoring (Neu et al., 2021). Fueled by the advances made in Deep Learning (DL), the prospect of accurate predictors of the (process) future has received a lot of attention and promises organizations to act on process problems before they manifest. This complements the more retrospective nature of process discovery, which provides aggregated and interpretable models of historical cases by leveraging events logged by information systems.
Several prediction tasks have been investigated (Neu et al., 2021): looking at the immediate future by predicting properties of the next event (nextevent prediction), looking at what happens at the end of the process execution by predicting properties of the case (outcome prediction, e.g., loan application acceptance, or customer churn prediction), or attempting to construct the whole process execution by predicting properties of all future events (suffix prediction, i.e. sequence generation). Event properties of interest include the event label or activity label and the timestamp. Case or outcome properties are often concerned with some performance indicator of the process. We focus on suffix prediction as the most challenging task for predicting event properties.
DL has been proposed for all three categories of prediction tasks including suffix prediction and consistently outperformed all alternatives based on the remaining methods (Neu et al., 2021)
. Many of the DL architectures have been adapted for event logs from Natural Language Processing (NLP) and Computer Vision, and led to substantial improvements. However, traces are multi modal, e.g., contains both label and time, do not form a continuous corpus as in NLP, and the tracelength distribution of the event log can be highly skewed. DL provides efficient training due to the graphic cards and parallel processing. To ideally enable parallel processing, batching of equallength samples is applied, which usually needs padding. Tracelength skewness leads to the necessity of introducing a large number of padding tokens.
Two studies have explored how some of the peculiarities of event logs are related to the prediction performance for nextevent prediction (Heinrich et al., 2021) and for outcome prediction (Kratsch and others, 2021). The performance of suffix prediction was evaluated only on average performance measures for a few reallife event logs (Neu et al., 2021). In addition, different preprocessing and evaluation strategies have been used which makes it difficult to compare the results of the different approaches (Neu et al., 2021; Weytjens and De Weerdt, 2021). Our work contributes a unified framework for case suffix prediction with seven sequential DL models, multimodal event attribute fusion, and multitarget prediction (Section 3). The endtoend framework automatically processes event logs and avoids feature engineering. We show in detail how the performance behavior of different models vary over all prefixes of different length and highlight corresponding challenges (Section 4). We show that skewness affects all DL models and that it is insufficient to solely rely on average performance measures.
2. Problem Formulation and Notation
In the suffix and remaining time prediction, the dataset, i.e., event log, is a set of sequences (process executions or traces) , where is the size of dataset. The th process execution contains a sequence of events, i.e., . An event has two attributes where the former is the event’s label, i.e., an activity, and the latter is the event’s duration time, or the required execution time. For a given process execution =, the prefix of events of length is defined by = and its corresponding suffix of events is =.
Definition 2.1 (Suffix prediction and remaining time prediction).
Suppose that there are pairs sample of sequences
, where is the prefix length and is the sample size (also the set of prefixes ). Given a prefix of events sequence, the output prediction is the sequence of events , where [EOS] is a special symbol added to the end of each process execution to mark the end of the sequence in prepossessing time. Suffix prediction is the sequence of activities in , i.e., . The remaining time prediction is the sum of the predicted duration time in , i.e., , where .
3. Proposed Framework
The proposed framework in Fig. 1
provides deep (i.e. several layers stacked), autoregressive modeling for multiobjective prediction tasks of multimodal sequences (e.g. log traces). Our vision is to perform no feature engineering and to use the original event logs without excluding traces or trace parts, e.g., we do not trim very long nor exclude very short traces.
We describe the three main components of our framework: embedding of an input sequence for the fusion of modalities (Section 3.1), the pluggable sequential model component (Section 3.2), and the generator providing multitask predictions (Section 3.3).
3.1. Embedding
In the proposed framework each event is shown by tuple , where
is the onehot encoding of activity
. We denote theth entry of a vector by the corresponding subscript, e.g.,
. The embedding component maps the input features into the latent space of the model.First, there are two tasks for preprocessing the input event traces. 1) The heterogeneous event attributes have to be represented in form which is adequate as input features for machine learning computations. The categorical attributes (i.e. activity label) are onehot encoded into a vector , and the continuous attributes (i.e. timestamp) are minmax scaled to the range of and represented with a scalar . Our framework uses two attributes: the activity label, and the timestamp. We take the time attribute on the scale of seconds granularity and transform it to relative time expressing the duration of events (by the difference of two consecutive event timestamps). This also prevents a particular data leakage from the test set as indicated by (Weytjens and De Weerdt, 2021). The time attribute value corresponding to any special symbols for the activity label (e.g. [EOS], [PAD], [SOS], and [MASK]) is defined as zero. 2) For utilising parallel processing resources, the input sequences are batched. This introduces a constraint: sequences in a batch have to be of equal length. Several approaches are possible to ensure that, such as splitting the original sequences into smaller movingwindows of subsequences or left/middle/right padding of the original sequences (Rio and others, 2020). All approaches may impact the prediction performance of the model: windowing limits the receptive field of the model and padding changes the activity label distribution of the dataset due to the excessive amount of padding needed.
Second, the event attributes have to be fused. Since attributes are perfectly aligned within a trace, we can deal with aligned fusion of the different modalities. (Heinrich et al., 2021; Schönig et al., 2018; Tax et al., 2017; Taymouri et al., 2021) use feature concatenation, (Zadeh et al., 2017)
applies tensor fusion in the highdimensional Cartesian space of the feature vectors,
(Tsai et al., 2019) performs crossmodal attention on the concatenated sequence of all modalities, (Lin et al., 2019; Moon et al., 2021) utilise learnable transformations for mapping and weighting the influence of the attributes.We also apply learnable linear transformations for each feature. The scalar variables (e.g. time attribute) are defined as a rankzero tensor then broadcasted into a rankone tensor
^{1}^{1}1 stands for the tensor dimension Expansion operation before the transformations. The resulting vectors are summed to form the distributed embedding vector (see Fig. 1).3.2. Sequential Deep Learning Models
We introduce ML with special emphasis on generative and AutoRegressive (AR) modeling, then we detail several DL architectures that we tested in the framework.
ML seeks to develop methods to automate certain prediction tasks on the basis of observed data. Almost all ML tasks can be formulated as making inferences about missing or latent data from the observed data. To make inferences about unobserved data from the observed data, the learning system needs to make some assumptions; taken together these assumptions constitute a model. Learning from data occurs through the transformation of the prior probability distributions (defined before observing the data), into posterior distributions (after observing data)
(Ghahramani, 2015).We define an input space which is a subset of dimensional real space
. We define also a random variable
with probability distribution
which takes values drawn from . We call the realisations of feature vectors and noted . A generative model describes the marginal distribution over : , where samples of are observed at learning time in a dataset and the probability distribution depends on some unknown parameter . A generative model family which is important for sequential analysis is the AR one. Here, we fix an ordering of the variables and the distribution for the th random variable depends on the values of all the preceding random variables in the chosen ordering (Bengio et al., 2015). By the chain rule of probability, we can factorize the joint distribution over the
dimensions as: . AR modeling performs training on datasetto estimate of the unknown parameter
by maximizing the likelihood under the forward autoregressive factorization: . In our scope the event from a trace (see in Def. 2.1) is the analogy of .We describe 7 DL architectures (out of which 4 are fist time applied on suffix generation in PM) that all model autoregression but are built up with different building blocks altering training and inference (computation and memory) complexity, model (parameter) size, perspective view (along the sequence), path length (between two positions in the sequence), and parallelisability of operations during training.
LSTM: (Evermann et al., 2017; Tax et al., 2017)
are pioneer DL solutions applying Recurrent Neural Network (RNN)
(Jordan, 1986) for predictive PM. RNN is an AR model with a feedback loop which allows processing the previous output with the current input, thus making the network stateful, being influenced by the earlier inputs in each step. The simplified equation of a simple recurrent layer is: ^{2}^{2}2Where the different s are the learnable parameters. The hidden state depends on the current input and its previous hidden state. There is a nonlinear dependency (e.g., the sigma function) between them. Long Short Term Memory (LSTM)
(Hochreiter and Schmidhuber, 1997) is a more recent variant which has an additional horizontal residual channel and learnable gating mechanisms for improving on the gradient vanishing during Back Propagation Through Time (BPTT). The first LSTM application to process mining is (Tax et al., 2017) with a singleevent target during training. They augment the dataset with all prefix combinations of traces (, see in Def. 2.1) during preprocessing. During training, the the target is the prediction of the sole next event (see on Fig. 2). During inference, there is an openloop suffix generation of events conditioned on a given prefix, that continues until the [EOS] special symbol or reaching a predefined maximum length: . Equallength prefixes can be batched together for better utilisation of parallel processing resources.AE:
An AutoEncoder (AE) is a bottleneck architecture that turns a highdimensional input into a latent lowdimensional code (encoder), and then performs a reconstruction of the input with this latent code (the decoder)
(Hinton and Salakhutdinov, 2006). The two components are jointly trained together via the reconstruction objective. Sequential autoencoder (i.e. encoder and decoder are both sequential models such as an RNN) is applied in PM for suffix prediction by (Lin et al., 2019; Taymouri et al., 2021). The encoder learns the representation of the prefix, the decoder of the suffix. The encoder maps the prefix into its latent space: , then the decoder predicts the suffix conditioned on that (i.e. the encoder passes a context vector representing the prefix): . The dataset is augmented with all of prefixsuffix combinations of traces. The target during the training is to reconstruct the onestep lookahead version of the suffix input ( on Fig. 2). The suffix (input and target) can be padded. During inference, after encoding the prefix, there is an openloop suffix generation of events started with the [SOS] special symbol and until the [EOS] or reaching a predefined maximum length.AEGAN: (Taymouri et al., 2021) extended the sequential AE with a sequence discriminator component which provides feedback about the likeliness of the generated suffixes: , where , and
is a temperature parameter which is annealed during training. The architecture setup and the training mechanism follows the Generative Adversarial Network (GAN)
(Goodfellow et al., 2014), although, its objective is not exclusively the minimax game. The discriminator adds an auxiliary adversarial objective to the existing reconstruction one, and it is trained jointly with the autoencoder. In the zerosum game setup, the objective of the autoencoder is to generate suffixes which follow the distribution of the dataset (), and the adversarial objective of the discriminator is to discriminate the synthetic ones (i.e. assign ) from the real ones (i.e. assign ).The auxiliary adversarial loss for AE is:
. To alleviate the exposure bias by teacher forcing (i.e. during training the target is the sole next event but during inference a suffix is generated), openloop decoding is applied of the time.
Transformer: One of the main improvements of the Transformer model is to parallelise the sequence processing. To parallelise the sequence processing (i.e. avoid the recurrence) and improve on the encoderdecoder relation, i.e. access more than a sole context vector to draw global dependencies between input and output, (Vaswani et al., 2017) introduces the Transformer. This is a sequential AE architecture with selfattention mechanism in the encoder and decoder, respectively, and crossattention between the two. Selfattention is an attention mechanism relating different positions of a sequence in order to compute the representation of the sequence. The attention function can be described as mapping a query and a set of keyvalue pairs to an output, where the query , keys , values , and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility/similarity function of the query with the corresponding key. The sequence of input events are first packed into , and then encoded into contextual representations at different levels of abstraction using an layer Transformer . In each Transformer block, multiple selfattention heads are used to aggregate the output vectors of the previous layer. For the th Transformer layer, the output of a selfattention head is computed via:
(1)  
(2)  
(3) 
where the previous layer’s output is linearly projected to a triple of queries, keys and values using parameter matrices , respectively, and the mask matrix determines whether a pair of positions (in the sequence) can be attended to each other to control what context a token can attend to when computing its contextualised representation. The encoder has bidirectional modeling, meaning that the elements of are all s, indicating that all the positions have access to each other. The decoder has a lefttoright modeling objective. The representation of each token encodes only the leftward context tokens and itself. This is done by using a triangular matrix for , where the upper triangular part is set to and the other elements to . In the encoderdecoder/cross attention layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder. This allows every position in the decoder to attend over all positions in the input sequence. This mimics the typical encoderdecoder attention mechanisms in sequencetosequence models. Selfattention is translation invariant, so its architecture is extended by positional encoding mechanism which could be absolute, or relative. In our framework we apply the absolute positional encoding based on the sinusoidal function. The Transformer has the advantage of simultaneous processing of sequence positions even if its complexity is: . Furthermore, it has a receptive field of and a maximum path length of . For suffix generations we apply the same prefixsuffix, inputtarget, and traininginference settings as for the sequential autoencoder . We are the first who apply the Transformer architecture on suffix generation in a multimodal input and multitask prediction setting, (Bukhsh et al., 2021) applies a variant of it tailored to next event prediction.
GPT: A Transformer variant named Generative PreTraining (GPT) has been successfully used for sentence generation in NLP (Radford and Narasimhan, 2018) and for nextevent prediction in PM (Moon et al., 2021). It is the decoder block of the Transformer which has a lefttoright modeling capacity. (Moon et al., 2021) do not model the multimodal event input and does not aim to have multitarget predictions either. In our framework we study GPT’s suffix generation performance in the multimodal input and multitarget prediction setting. on Fig. 2 visualises the inputtarget setting during training. There is no need for data augmentation of prefixsuffix splits, this makes the training very effective because any of the sequences/traces can be batched together. During inference, it is also an openloop suffix generation of events, conditioned on a given prefix and until the [EOS] special symbol or reaching a predefined maximum length .
BERT: A language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers is introduced for bidirectional representation by (Devlin et al., 2019)
. BERT is designed to pretrain deep bidirectional representations from text by jointly conditioning on both left and right context in all layers. To alleviate the unidirectionality constraint it uses a Masked Language Model (MLM) training objective. MLM randomly masks some of the tokens from the input, and the objective is to predict the original masked word based only on its context. Unlike lefttoright language model training, the MLM objective enables the representation to fuse the left and the right context. BERT is based on denoising autoencoding. Specifically, for a sequence
, BERT first constructs a corrupted version by randomly setting a portion (e.g. ) of tokens in to a special symbol [MASK]. Mathematically, the corruption procedure can be written as , where is a binary mask of the same size as , which uses to indicate a token will be masked. This is followed by training a Transformer encoder model to reconstruct only the masked tokens in the original text, denoted as , based on the corrupted text . The training objective is to reconstruct from : , where indicates is masked. As emphasized by the sign, BERT factorizes the joint conditional probability based on an independence assumption that all masked tokens are separately reconstructed. on Fig. 2 visualises the inputtarget setup during training. There is no need for data augmentation of prefixsuffix splits, this makes the training very effective because any of the sequences/traces can be batched together. To enhance BERT for sequence generation, we applied a probabilistic mask portion (PMLM) (Liao et al., 2020)instead of a fixed percentage. In each training step, the masking ratio is sampled from the uniform distribution
. Hence, during the course of training, all masked permutations of the sequence is seen. During inference, the initial condition is that all the suffix positions are masked, then during the loop, all suffix positions are visited onebyone in a random order. The token at the predicted position is added then to the context for the next step until all suffix positions are visited. During evaluation the subsequence from left, up to the first [EOS] is resulted .WaveNet: (van den Oord and others, 2016) is a generative model which has been introduced for signal time series. It is a fully (
D) Convolutional Neural Network (CNN), where the stacked convolutional layers are causal and dilated. There are no pooling layers in the network, and the output of the model has the same time dimensionality as the input:
on Fig. 2 visualises the inputtarget setting during training which is the same setup as for GPT. At training time, the conditional predictions for all positions can be made in parallel because all positions of ground truth are known; the inference procedure is sequential as of all the previous AR models . One of the problems of causal convolutions is that they require many layers, or large filters to increase the receptive field. The authors apply dilated convolutions to increase the receptive field by orders of magnitude, without greatly increasing computational cost. In each layer the dilations are increasing by a factor of two, hence, the receptive field is: , where is the number of layers and is the convolutional filter size. We are the first who apply a CNN architecture on suffix generation in a multimodal input and multitask prediction setting, (Weytjens and De Weerdt, 2020) applies a variant of it tailored to process outcome prediction and (Pasquadibisceglie et al., 2019) is tailored to next event prediction.3.3. Generator
The generator component offers multitask predictions. For each event in the sequence, categorical and numerical variables are predicted, respectively. For the categorical activity label
variable, the readout provides a vector of logits
. The logits are transformed into likelihoods by the softmax function (). There are several methods to decode the discrete token out of this categorical distribution of . The most commonly used method is to select the most likely category (Heinrich et al., 2021; Schönig et al., 2018; Tax et al., 2017), that is also called as argmax or greedy search. Beam search, an algorithmic breadthfirst search technique, can yield better quality of sequences with the cost of increased memory and computation (Taymouri et al., 2021). Different forms of stochastic sampling (from the categorical distribution) can be used also. In our framework we use the most common greedy search solution for now.The proposed framework provides multiobjective optimisation. Learning for the categorical features is via the categorical crossentropy loss (i.e. of the ground truth and the predicted ):
. The loss function is the average of such errors over all items in the sequence. Learning for the continuous features is via the squared error (i.e. of the ground truth
and the predicted ): . The loss function is the average of such errors over all items in the sequence.The final loss is a weighted sum of the individual losses. During inference the sequence generation is in a loop; a step is always conditioned on the previously generated context up to the [EOS] or predefined maximum length. We set that maximum to the length of the longest trace in the dataset which is a pragmatic choice.
4. Experimental Results
4.1. Evaluation Measures
To evaluate the performance of suffix prediction (in the viewpoint of the sequence of categorical activity labels), we use DamerauLevenstein distance (DL). This metric measures the quality of the predicted suffix of a trace by adding swapping operation to the set of operations used by regular Levenstein distance. Given two activity sequences and . For example, , and . We consider the following similarity: , where is the length of . , and it is when two sequences are the same and when two sequences contain completely different elements. Also, we compute the absolute error between the ground truth remaining time and the predicted remaining time for each predicted and ground truth suffixes. Next, we average these numbers for evaluation instances, and report Mean Absolute Error (MAE).
4.2. Datasets
We evaluate the performance of the models on reallife datasets (which are very commonly used in the PM community)^{3}^{3}3Publicly available at https://data.4tu.nl/search?q=:keyword:%20%22Task%20Force%20on%20Process%20Mining%22. without prepossessing. Fig. 3 shows the case length and activity distributions for all the datasets. The case length distribution of all datasets is heavily skewed. This brings a challenging situation for sequential prediction tasks. The models have to represent the underrepresented long traces, in the long tail of the distributions. The embedded activity distribution plots include the added [PAD] special symbol (red bar) and show the extreme relative frequency of [PAD]. We apply splitting into training and evaluation sets by 8:2 ratio after shuffling of traces, and use the of the traces for training, and the remaining traces for evaluation. We use the exact same subsets for all experiments.
4.3. Experimental Setup
The proposed framework is implemented^{4}^{4}4http://github.com/smartjourneymining/sequentialdeeplearningmodels in Python
, PyTorch
, and CUDA on a Linux server using an NVIDIA V100 GB GPU. Throughout all experiments we keep an equal number of layers with latent vector size of the sequential component, plus an embedding and generator with the same latent size. We train models for epochs and apply early stopping if we observe no more improvement on the evaluation set for 50 iterations. We use Adam as an optimization algorithm for the proposed framework with learning rate . To improve on generalisation we apply a dropout rate of .4.4. Results
As Table 1 shows all the models except BERT perform on a comparable level. The reason might be that BERT is more sensitive to the skewness because of its nonAR training objective. Furthermore, masked language modeling (PMLM) might require more training epochs (since it approximates all perturbations) for suffix prediction. AR models perform well on datasets with shorter traces and smaller vocabulary such as the Helpdesk and RTFM logs. Those logs are also relatively more much structured. BPI17 and Sepsis cases appear to be the most challenging datasets. Together with BPI12, those also have the longest traces.
Fig. 4
visualises the DLS of the suffix generations results for different prefix lengths (xaxis) starting from length two. We visualise the performance change of the model with a line chart (left yaxis) and the frequency of prefixes of a certain length in the data with a bar chart (right yaxis). The intuition would be that by increasing prefix size the DLS generally increases but it is empirically different. The bar chart aims to shed the light of the underlying phenomena. It could be connected to the number of traces in the dataset which are equal or longer than the given prefix length. Due to the non uniform trace length distribution, the count of traces is monotonically decreasing given the increasing prefix length. The slope of that decrease is small for shorter prefixes: in that interval the DLS generally increases. That is generally followed by a large drop in the count possibly due to the skewness: that results in performance degradation of the models in general. The models tend to be biased towards predicting shorter suffixes because there are much less longer traces in the dataset. The DLS calculation heavily punishes that bias. However, there are perfect DLS scores for the prefix sizes which are very close to the longest trace(s). That may be the result of overfitting given the infinitesimal variance among those extremely few traces.
It is interesting that there are results missing for some prefix lengths even though these prefixes do exist in the data (bar chart). This is due to the trainingevaluation random split: in the evaluation subset there are no corresponding traces with the same length. Thus, a sole, average DLS metric seems to be not always informative enough. For example, for the RTFM dataset GPT has the highest average DLS but for the prefix length of 12, it is the worst performing.
Table 2 shows that all the models perform on a comparable level regarding average MAE of the time prediction. LSTM, BERT, and WaveNet perform worse which might be because of the the limited receptive field in case of WaveNet and LSTM. BERT might needs more training iterations because of the randomness introduced in its MLM.
Fig. 5 visualises the MAE of suffix generation results with two charts in an overlay for all prefix combinations (xaxis) starting from length two. The line chart (left yaxis) visualises the the performance change of the models over the changing prefix size. The bar chart (right yaxis) plots the number of traces in the dataset which are equal or longer than the given prefix length. Due to the non uniform trance length distribution, the count of traces is monotonically decreasing given the increasing prefix length. The intuition is that by increasing prefix size the MAE generally decreases since there are less remaining time predictions to be made. This seems to be empirically supported.
5. Conclusions
We investigated how 7 sequential DL architectures (out of which 4 have never been applied on suffix generation in PM) perform for predicting the suffix of multimodal sequences. Evaluating the architectures across all prefix lengths showed that none of them is onesizefitsall and that with increasing lengths, the prediction performance for some prefixes fluctuates heavily with substantial drops in performance. These results have implications on how performance of DL models should be reported. Solely reporting the average DLS for suffix prediction seems to be inadequate for comparing DL architectures and for assessing their suitability in practice. Our hypothesis is that this effect is due to the skewness of the length distribution for traces in event log, which prevents the direct application of successful DL architectures for NLP tasks (in particular, BERT) in our experimental setting. In future work, also the impact of other event log properties, e.g., as proposed in (Heinrich et al., 2021; Kratsch and others, 2021), should be investigated for suffix prediction. Furthermore, controlled experiments (e.g., on simulated data from process models (Burattin, 2016)
) could be carried out to identify the impact of specific data properties. Finally, our experiments should be extended with hyperparameter tuning for each architecture to avoid the influence of manual parameter choices.
Acknowledgement. This work is part of the Smart Journey Mining project, which is funded by the Research Council of Norway (project no. 312198).
References
 Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 1171–1179. Cited by: §3.2.

ProcessTransformer: predictive business process monitoring with transformer network
. External Links: 2104.00721 Cited by: §3.2.  PLG2: multiperspective process randomization with online and offline simulations. In Proceedings of the BPM Demo Track 2016 Colocated with the 14th International Conference on Business Process Management (BPM 2016), Rio de Janeiro, Brazil, September 21, 2016, L. Azevedo and C. Cabanillas (Eds.), CEUR Workshop Proceedings, Vol. 1789, pp. 1–6. External Links: Link Cited by: §5.
 Attractor dynamics and parallelism in a connectionist sequential machine. In Proceedings of the Eighth Annual Conference of the Cognitive Science Society, pp. 531–546. Cited by: §3.2.
 BERT: pretraining of deep bidirectional transformers for language understanding. In NAACL, pp. 4171–4186. Cited by: §3.2.
 Predicting process behaviour using deep learning. Decis. Support Syst. 100, pp. 129–140. Cited by: §3.2.

Probabilistic machine learning and artificial intelligence
. 521 (7553), pp. 452–459. External Links: ISSN 14764687, Document, Link Cited by: §3.2.  Generative Adversarial Nets. In Advances in Neural Information Processing Systems 27, Cited by: §3.2.
 Process data properties matter: introducing gated convolutional neural networks (GCNN) and keyvaluepredict attention networks (KVP) for next event prediction with deep learning. Decis Support Syst 143, pp. 113494. Cited by: §1, §3.1, §3.3, §5.
 Reducing the dimensionality of data with neural networks. 313 (5786), pp. 504 – 507. Cited by: §3.2.
 Long ShortTerm Memory. 9 (8), pp. 1735–1780. External Links: ISSN 08997667, Link, Document Cited by: §3.2.
 Machine learning in business process monitoring: A comparison of deep learning and classical approaches used for outcome prediction. BISE 63, pp. 261–276. Cited by: §1, §5.
 Probabilistically Masked Language Model capable of autoregressive generation in arbitrary word order. In ACL proceedings, pp. 263–274. Cited by: §3.2.
 MMPred: a deep predictive model for multiattribute event sequence. In SDM19, pp. 118–126. Cited by: §3.1, §3.2.
 POPON: prediction of process using oneway language model based on NLP approach. 11 (2). External Links: ISSN 20763417 Cited by: §3.1, §3.2.
 A systematic literature review on stateoftheart deep learning methods for process prediction. Cited by: §1, §1, §1, §1.
 Using convolutional neural networks for predictive process analytics. In 2019 International Conference on Process Mining (ICPM), Vol. , pp. 129–136. External Links: Document Cited by: §3.2.
 Improving language understanding by generative pretraining. Cited by: §3.2.
 Effect of sequence padding on the performance of deep learning models in archaeal protein functional prediction. 10 (1), pp. 14634. External Links: ISSN 20452322 Cited by: §3.1.
 Deep learning process prediction with discrete and continuous data features. In ENASE 2018, pp. 314–319. Cited by: §3.1, §3.3.
 Predictive business process monitoring with LSTM neural networks. In CAiSE 2017, LNCS, Vol. 10253, pp. 477–492. Cited by: §3.1, §3.2, §3.3.
 A deep adversarial model for suffix and remaining time prediction of event sequences. In SDM21, pp. 522–530. Cited by: §3.1, §3.2, §3.2, §3.3.
 Multimodal Transformer for unaligned multimodal language sequences. In ACL proceedings, pp. 6558–6569. Cited by: §3.1.
 WaveNet: a generative model for raw audio. In Arxiv, External Links: Link Cited by: §3.2.
 Attention is all you need. In Advances in Neural Information Processing Systems, Vol. 30, pp. . Cited by: §3.2.
 Process outcome prediction: cnn vs. lstm (with attention). In Business Process Management Workshops, A. Del Río Ortega, H. Leopold, and F. M. Santoro (Eds.), Cham, pp. 321–333. External Links: ISBN 9783030664985 Cited by: §3.2.
 Creating unbiased public benchmark datasets with data leakage prevention for predictive process monitoring. In Arxiv, External Links: Link Cited by: §1, §3.1.

Tensor Fusion Network for multimodal sentiment analysis
. In EMNLP, pp. 1103–1114. Cited by: §3.1.