1 Introduction
Prescriptive Process Monitoring (PrPM) [5, 9] is a set of techniques to recommend or to trigger actions (herein called interventions) during the execution of a process in order to optimize its performance. PrPM techniques use business process execution logs (a.k.a. event logs
) to predict negative outcomes that affect the performance of the process, and use these predictions to determine if and when to trigger interventions to prevent or mitigate such negative outcomes. For example, in a loan origination process, a PrPM technique may trigger interventions such as giving a phone call to a customer, sending them an alternative loan offer, or offering them a discounted insurance premium if they bundle the loan with an insurance contract. These interventions may contribute positively to the probability that the customer will accept a loan offer within a given timeframe.
Several PrPM techniques have been proposed in the literature [5, 9, 3]. These techniques, however, assume that it is possible to trigger any number of interventions at any point in time. In practice, each intervention requires some resources (e.g. time from an employee) and those resources have a limited capacity. For example, an intervention could be providing an alternative loan offer to increase the applicant’s opportunities of taking a loan in a loan handling process. This intervention entails a specific time from a loan officer. Thus, it is impossible to trigger it unless a loan officer is available to perform the intervention.
In this setting, this paper addresses the question of whether or not to trigger an intervention during the execution of an instance of a business process (herein called a case) in order to optimize a gain function that takes into account the cost of the case ending in a negative outcome and the cost of the intervention. Relative to previous work in this area, the paper tackles this question in the context where each intervention requires locking a resource for a given treatment duration and where the number of available resources is bounded.
To address this question, the paper uses a predictive modeling approach to estimate the probability of negative case outcomes together with a causal inference approach to estimate the effect of triggering an intervention on the probability of a negative case outcome. Based on these outputs, the gain of triggering an intervention for each ongoing case is estimated, and this estimate is used to determine which cases should be treated given the available resources.
The paper reports on an evaluation on a reallife event log, aimed at comparing the proposed approach with a baseline that relies only on predictive models.
The rest of the paper is structured as follows. Section 2 presents background concepts and related work. Section 3 explains our approach. Then, Section 4 set up the experiments and evaluates the introduced technique. Finally, Section 5 summarizes this paper and addresses possible later improvements.
2 Background and Related Work
2.1 Predictive Process Monitoring
This paper deals with the problem of triggering interventions in order to minimize the probability of a case ending in a negative outcome. This problem is directly related to that of predicting the probability of negative case outcomes, which is the problem addressed by socalled outcomeoriented Predictive Process Monitoring (PPM) techniques [13]. The core of any outcomeoriented PPM problem is an event log representing the execution of a business process. An event log is a set of complete traces where each one consists of a sequence of events, each containing different attributes. Three of them exist in each event, specifically the case identifier (a unique reference to an instance of a process in which an event occurs), an activity (describes what happens?), and timestamp (indicates when activity occurs?).
An extract of a loan handling process is shown in Figure 1 as a running example with two traces. Each trace consists of a sequence of event records (herein called events) wherein each record contains at least three attributes: a case identifier (), an activity label (activity), and a timestamp. In other words, each event describes the occurrence of activity at a specific point in time and belongs to a given case. Other event attributes might exist, like who does the activity, i.e. the resource. Additional attributes may be of one of two types: case attributes or event attributes. Case attributes are attributes whose values do not change within a case. For example, in Figure 1, the log contains two case attributes the age and gender of the client. On the other hand, event attributes are attributes whose value may change from one event to the next one within each case. For example, the resource attribute is an event attribute because every event in a trace is likely to be assigned to a different resource.
Outcomeoriented PPM methods predict the outcome of an ongoing case, given its (incomplete) trace. In a typical binary PPM method, the outcome of a case may be positive (e.g. a client accepted the loan offer) or negative (the client did not accept the offer). Accordingly, a precondition for applying a PPM method is to have a notion of case outcome, as well as historical data about case outcomes. In the above example, this means that for each trace we need to know whether or not the customer accepted the loan offer. An event log in which each trace is labeled with a case outcome is called a labeled event log.
PPM methods typically distinguish between an offline training phase and an online prediction phase. In the offline phase, a predictive model (specifically a classification model) is trained based on historical (completed) cases. This model is then used during the online phase to make predictions based on incomplete traces. To train models for PPM, a typical approach is to extract all or a subset of the prefixes of the labeled trace in an event log, and to associate the label of the full trace to every prefix extracted from the trace. A dataset of this form is called a labeled prefix log. A labeled prefix log is a set of prefixes of traces, each one with an associated case outcome (positive or negative).
We use the labeled prefix log to train a machine learning algorithm to build a predictive monitoring model. However, we need first to encode the prefixes in the prefix log of each trace as socalled
feature vectors
(herein called trace encoders). Teinemaa et al. [12] propose and evaluate several types of trace encoders and find that aggregation encoder consistently yields models with high accuracy.An aggregate encoder is a function that maps each prefix of a trace to a feature vector. Simply, it encodes each case attribute as a feature (or onehot encode categorical case attributes). For each numerical event attribute, use an aggregation method (e.g., sum) over the sequence of values taken by this attribute in the prefix. For every categorical event attribute, encode every possible value of that information as numerical features. This information refers to the number of times this value has appeared in the prefix. An example of applying aggregate encodings to
with is shown in figure 2.2.2 Prescriptive Process Monitoring
Prescriptive process monitoring (PrPM) is a family of techniques that play an essential role in optimizing the performance of a business process by triggering interventions at runtime. Recently, several studies in the literature discuss the impact of turning PPM problems into PrPM to improve business processes.
Fahrenkrog et al. [5] introduce an approach to generate single or multiple alarms when the probability of a case leading to an undesired outcome is above a threshold (e.g., 70%). Each alarm triggers an intervention, which reduces the probability of a negative outcome. Their method optimizes the threshold empirically w.r.t a total gain function.
Metzger et al. [9]
propose ensemble methods to compute predictions and reliability estimates to optimize the threshold instead of optimizing it empirically. They introduce policybased reinforcement learning to find and learn when to trigger proactive process adaptation. This work targets the problem of learning when to trigger an intervention, rather than the question of whether or not to trigger an intervention.
Both the technique of Metzger et al. and that of Fahrenkrog et al. work under the assumption that the number of interventions that may be triggered at a given point in time is unbounded. In contrast, the technique proposed in this paper explicitly takes resource constraints into consideration.
Weinzerl et al. [14] propose a PrPM technique to recommend the next activity in each ongoing case of a process, to maximize a given performance measure. This previous study does not consider an explicit notion of intervention, and thus, it does not take into account the cost of an intervention nor the fact that an intervention may only be triggered if a resource is available to perform it.
2.3 Causal Inference
Causal Inference (CI) [15] is a collection of techniques to discover and quantify causeeffect relations from data. Causal inference techniques have been used in a broad range of domains, including process mining.
In [2], the authors introduce a technique to find guidance rules following Treatment Outcome relation, which improves the business process by triggering an intervention when a condition folds. They generate rules at design time in the level of groups of cases that will be validated later by domain experts. More recently, in [3], they address another target problem, which is reducing the cycle time of a process using interventions in a way to maximize a net gain function. Both works [2] and [3] consider the estimation of the treatment effect. However, they assume that interventions with a positive impact occur immediately and do not examine the finite capacity of resources.
Causal inference techniques are categorized into two main frameworks [8]: (1) Structural Causal Models (SCMs), which consist of a causal graph and structural equations [1]. SCM focuses mainly on estimating the causal effects through a causal graph which a domain expert manually constructs. (2) Potential outcome frameworks that focus on learning the treatment effects for a given treatmentoutcome set . Our work utilizes the latter, which focuses on automatic estimation methods rather than manually constructed graphs.
We use potential outcome models to estimate the treatment effect hereafter called conditional average treatment effect (CATE) from observational data. In particular, we use
rthogonal random forest (ORF)
algorithm that combines treebased models [1] and double machine learning [4] in one generalized approach [10]. It estimates the on an outcome when we apply a treatment to a given case with features .ORF requires input to be in the form of for instances. For each instance ,
is described by a binary variable
, where refers to treatment is applied to a case and that it is not. refers to the observed outcome. describes potential confounding properties, and is the information achieving heterogeneity.3 Approach
The primary objective of our approach is to determine whether or not to treat a given case and when an intervention takes place to maximize the total gain. To learn whether or not to treat, we build predictive and prescriptive models in the learning phase. Then, the resource allocator selects when to treat.
The approach consists of two main phases, as shown in figure 3. In the learning phase, we prepare the event log to build two different machine learning models. The first one represents the predictive model to predict the undesired outcome of cases. The second one is the causal model to estimate the impact of a given intervention on the outcome of a case. Then in the resource allocator phase, the predicted probability of the negative outcome and the estimated treatment effect are used to determine the net gain.
Following, we explain each step in the two phases in detail. We start with defining the preprocessing, predictive, and causal models from the first phase. Then we describe the resource allocator that enables the highest total gain.
3.1 Data Preprocessing
To obtain the best performance of either predictive or causal models, event log, i.e., a loan application process, preprocessing is an essential step. In addition to the preprocessing given by [13], we define the outcome of cases based on the end activity. We represent cases that end with ”A_Pending” events as a positive outcome, where cases that have ”A_Denied” or ”A_Cancelled” events are adverse outcomes that need intervention. Then, we define the intervention that we could apply to minimize the unsuccessful loan applications based on the winner report of the BPIC challenge [11]. They report that making more offers to clients increases the probability of having ”A_pending” as an end stat. Accordingly, we represent cases with only one offer to be treated where . In contrast, cases with more than one offer should not be treated, then
3.2 Predictive Model
We build a predictive model to estimate the probability that cases will end with the undesired outcome. We use the estimated probabilities as a threshold that we optimize empirically to decide if we move forward to estimate the treatment effect and define gains or not.
In order to build a predictive model as shown in figure 4, first, we extract prefixes of length from every trace that results in a socalled prefix log. This prefix extraction guarantees that our training log is similar to the testing log. For instance, If we have a complete trace containing seven events, we extract prefixes up to five events. Then we will have five incomplete traces starting with a trace containing only one event till a trace carrying five events. Next in the aggregate encodings step, we encode each trace prefix into a fixedsize feature vector (see example in figure 2). Finally,we use the encoded log to train a machine learning method to estimate the probability of the undesired outcome.
This paper deals with an outcomeoriented PPM problem, a classification problem from a machine learning perspective. The output from training a classification technique is a predictive model to estimate the probability of the undesired outcome (i.e., ) of running cases.
3.3 Causal Model
We use ORF to build a causal model to estimate the treatment effects or the of an intervention in a given case. Using ORF in causal process mining has different benefits compared to other causal estimation techniques. By nature, event logs have many event attributes with categorical and resource features that may lead to feature explosion. ORF implements nonparametric estimation for the objective parameter, i.e., outcome. Meanwhile, ORF is perfect with highdimensional confounding variables, which is the situation in our problem.
To estimate CATE using ORF, input needs to be in the form of for instances. For each instance , is the accepted treatment. refers to the observed outcome. describes the potential confounding variables, and is the information achieving heterogeneity. In this work, we deal with an outcomeoriented loan application process it means the purpose is to increase the rate of successful loan applications via treating ongoing applications. We hypothesized that the intervention increases the number of successful applications, and we assume that the treatment is identified beforehand. and are obtained from the encoded log, and we assume that all log attributes are too possible confounders . Nevertheless, and may not be the same variables where a domain expert can specify which features would be removed from if they do not improve the outcome.
Next, and based on the above descriptions, we train an ORF to estimate the treatment effect. The output from training an ORF technique is a causal model used to estimate for running cases.
3.4 Resource Allocator
We trained two models in the learning phase: the predictive one to estimate the probability that a case will end with the undesired outcome and the causal model to estimate the of utilizing an intervention in a given case. We use both models with the resource allocator to decide whether or not to treat a given case and when the intervention takes place to maximize the total gain.
Regularly triggering interventions in cases may come with gain; however, it comes at a cost. Therefore, to define the total gain, we determine the costs with and without intervention if the predictive model gives a probability higher than a specific threshold . Especially, suppose the intervention cost is relatively expensive as opposed to the advantage that it could afford. In that case, it becomes more critical to decide whether or not to treat a given case.
A suitable threshold is not identified beforehand. One solution is to define and optimize the threshold empirically to obtain maximal gain instead of a random fixed value. The threshold is used to ensure that a given case has a high probability of ending with the undesired outcome, i.e., .
Definition 1
Cost with no intervention. The cost when ends with an undesired outcome without applying the intervention; therefore, is shown in equation 1. The is the estimated probability of the undesired outcome from the predictive model, and is the cost of the undesired outcome.
(1) 
Definition 2
Cost with intervention. The cost when ends with an undesired outcome with applying the intervention; therefore, is shown in equation 2. The is the estimated causal effect of applying to resulting from the ORF model. is the cost of employing to .
(2) 
Now, we have the costs with () and without () the intervention, the estimated probability (), and in our pocket. The next step is defining the from applying to that enables the highest cost reduction based on equations 1 and 2, as shown in equation 3. The gain decides whether or not to treat , which solves the first part of our problem.
Definition 3
Gain.
(3) 
For example, suppose we have an event log with six cases (see table 1), the , and the . We have two situations where we do not calculate the costs with and without intervention and, therefore, the gain. The first one is presented with where the estimated probability is below a certain threshold, for instance, . The other one is given with , where there is no positive effect of applying intervention to the case; though, the . Other cases fulfill the conditions of having and .
A  0.55  20  1  0.3  11  6  5 
B  0.64  20  1  0.12  12.8  11.4  1.4 
C  0.4  20  1         
D  0.8  20  1  0.13  16  14.4  1.6 
E  0.9  20  1  0.22  18  14.6  3.4 
F  0.51  20  1  1.2       
The second part of the problem is deciding when we treat a given case assuming that intervention fulfills the required conditions, i.e., and . We use the resource allocator to tackle this part.
The resource allocator monitors the availability of resources to allocate them efficiently. Allocating resources to raises another question: how long, i.e., treatment duration, the allocated resource is blocked to apply .
A simple way to define the treatment duration (hereafter ) is to set it as a fixed value based on the domain knowledge. However, the variability of might affect the net gain; therefore, we examine three different distributions for the , i.e., fixed, normal, and exponential.
Finally, and based on the domain knowledge that tells us how many resources are available to apply , we keep an ordered list of the max gains for each running case . Once we have an available resource, we allocate it to apply to with the max gain in our ordered list and block it for .
For example, in table 1, , suppose and are available. First, we allocate to and to and block them for . Then, enters; but, we can not treat it since there are no available resources. Accordingly, we keep and (that comes later) on our sorted list and wait for available resources. Once we have an available resource, we allocate it first to because it has the max gain, then .
4 Evaluation
In this part, we describe the empirical evaluation of the introduced approach. Mainly, our evaluation discusses the following research questions:
To what extent the total gain depends on the number of available resources?
To what extent the total gain depends on the variability of the treatment duration?
When allocating resources to cases with higher gain versus cases with higher undesired outcome probability, what is the total gain?
In the following, we first in 4.1 present the reallife event log employed in our evaluation. Then we explain the experimental setup in 4.2. Finally, in 4.3, we show the results in more detail that are relative to the above research questions.
4.1 Dataset
We use one reallife event log, namely BPIC2017, corresponding to a loan origination process, to evaluate our approach.^{1}^{1}1Available at https://doi.org/10.4121/uuid:5f3067dff10b45dab98b86ae4c7a310b. In this event log, each case corresponds to a loan application. Each application has an outcome. The desired one occurs when offering clients a loan, and clients accept and sign it. While the undesired one occurs when the bank cancels the application or clients rejects the offer. The log contains applications and events.
We used all possible attributes that exist in the log as input to the predictive and causal models. Furthermore, we extracted other features, e.g., the number of offers, event number, and other temporal information, e.g., the hour of the day, day of the month, and month. We extracted prefixes at the percentile of all applications lengths to avoid bias from long cases. In addition, before the outcome of applications become useless to predict. We encoded the extracted prefixes using aggregate encoding to convert them into a fixedsize feature vector.
4.2 Experiment setup
We used Python to implement our approach (see figure 3). For the predictive model, we utilized XGBoost^{2}^{2}2https://github.com/dmlc/xgboost to estimate the probability of the undesired outcome, i.e.,
. XGBoost has shown promising results on different classification problems
[6], [7]. On the other hand, we used ORF to estimate the implemented inside the EconMl^{3}^{3}3https://github.com/microsoft/EconML package. EconML is a Python package that uses the strength of machine learning methods to estimate the causal effects of utilizing interventions from observational data.Predictive model  Learning rate  Subsample  Max tree depth  Colsample bytree  Min child weight 

XGBoost  0.2  0.89  14  0.54  3 
Causal model  # trees  Min leaf size  Max depth  Subsample ratio  Lambda reg 

ORF  200  50  20  0.4  0.01 
The predictive and causal models follow the same workflow as any machine learning problem. To tune and evaluate these models, we split the log into three parts (   ) temporally to simulate reallife situations. Mainly, we arrange cases using their timestamps. We use the opening for training () and tuning (), and the rest () to evaluate model performance. Table 2 shows the training parameters settings for each model. While table 3 shows the configurations of the proposed approach.
We present the configuration parameters we follow in our experiments in table 3. We vary the values to make them more significant than the value in a way that gives a meaningful result. We found that the higher related to , the more net gain. Accordingly, we applied the higher value of the in our experiments with different treatment distributions and an empirically optimized threshold to answer our research questions.
We compare our approach to a purely predictive baseline proposed in [5], where we interventions are triggered as soon as . In other words, we allocate resources to cases with the highest instead of cases with max gain, and we consider the as the new gain we achieve from treating cases.
#  (sec)  

1  Fixed =  
Normal {}  
Exponential {} 
4.3 Results
We present the results of our proposed approach by exploring the effects of available resources on the total gain and the percentage of treated cases, taking into account the variability of (4 and 4). Figure 4(a) shows how the total gain and percentage of treated cases evolve as we increase the number of available resources (4). When the number of available resources increases, both metrics increase. Meanwhile, if the available resources reach above , the total gain almost increases exponentially. That is because more cases are treated when more than half of the resources become available.
Moving to 4, we experiment with three distributions, i.e., fixed, normal, and exponential. Figure 4(a)
shows that the fixed distribution gives more net gain because there is less variability in the distribution of resources among cases that need intervention than normal and exponential distributions where the level of variability decreases, respectively. Accordingly, the net gain highly depends on the variability of treatment duration.
To answer 4, we allocate resources to cases with the highest instead of cases with max gain. We consider the as a new gain we achieve from treating cases. Therefore, we need a threshold to determine whether or not to intervene depending on the . There are two approaches to set a threshold: first, and based on a given threshold, e.g., , if there are available resources and the undesired outcome above the given threshold, we trigger an intervention. The second is to use an empirical threshold proposed by [5], where authors compute an optimal threshold based on historical data. We varied the threshold as shown in table 3. However, the results are different based on the distribution. Where
, the normal distribution gives more net gain than other thresholds. While
, the exponential distribution delivers the higher net gain. Moreover, with the fixed distribution wins.We observe that our approach consistently leads to higher net gain, under the same amount of consumed resources, than the purely predictive baseline. For example, under a fixed distribution, treating of cases with our approach (cf. Figure 4(a)) leads to a net gain of , while in the predictive method (Figure 4(b)), treating twice more cases ( of cases) yields a net gain of only . This suggests that the combination of causal inference with predictive modeling can enhance the efficiency of prescriptive process monitoring methods.
5 Conclusion
We introduced a prescriptive monitoring approach that triggers interventions in ongoing cases of a process to maximize a net gain function under limited resources. The approach combines a predictive model to identify cases that are likely to end in a negative outcome (and hence create a cost) with a causal model to determine which cases would most benefit from an intervention in their current state. These two models are embedded into an allocation procedure that allocates resources to case interventions based on their estimated net gain.
A preliminary evaluation of the approach suggests that our approach treats fewer cases and allocates resources more effectively, relative to a baseline method that relies only on a predictive model, as suggested in previous work.
In the proposed approach, an intervention is triggered on a case whenever the estimated net gain of treating this case is maximal, relative to other cases. Under some circumstances, this may lead to treating a case at a suboptimal time. For example, in a loan origination process, calling a customer two days after sending an offer may be more effective than doing so just one day after the offer. Our approach would trigger the intervention “call customer” one day after the offer if it turns out that the expected benefit is positive and there is no other case with a higher net gain. An alternative approach would be to allocate resources based both on the estimated net gain of a case intervention at the current time, and the expected gain of intervening in the same case at a future time. An avenue for future work is to combine the proposed method with a method that optimizes the point in time when an intervention is triggered for a given case.
A related direction for future work is to take into account constraints on the moment in time when interventions can be triggered on a case. For example, calling a customer to followup on a loan offer does not make sense if the loan offer has been canceled or the customer has not received a loan offer.
Another limitation of the proposed approach is that it assumes that there is a single type of intervention. In reality, there may be multiple possible types of interventions (e.g. call the customer, send a second loan offer, offer a bundled product). Another possible future work direction is to extend the proposed approach to handle multiple types of interventions, particularly when such interventions require resources from a common resource pool.
Reproducibility. The implementation and source code of our approach can be found at https://github.com/mshoush/PrescriptiveProcessMonitoring.
References
 [1] (2019) Generalized random forests. The Annals of Statistics 47 (2), pp. 1148–1178. Cited by: §2.3.
 [2] (2020) Process mining meets causal machine learning: discovering causal rules from event logs. In ICPM, pp. 129–136. Cited by: §2.3.
 [3] (2021) Prescriptive process monitoring for costaware cycle time reduction. ICPM. Cited by: §1, §2.3.
 [4] (2018) Double/debiased machine learning for treatment and structural parameters. Oxford University Press Oxford, UK. Cited by: §2.3.
 [5] (2019) Fire now, fire later: alarmbased systems for prescriptive process monitoring. arXiv preprint arXiv:1905.09568. Cited by: §1, §1, §2.2, §4.2, §4.3.

[6]
(2014)
Do we need hundreds of classifiers to solve real world classification problems?
. J. Mach. Learn. Res. 15 (1), pp. 3133–3181. Cited by: §4.2.  [7] (2014) Prescriptive analytics for recommendationbased business process optimization. In BIS, pp. 25–37. Cited by: §4.2.
 [8] (2020) A survey of learning causality with data: problems and methods. ACM Comput. Surv. 53 (4), pp. 1–37. Cited by: §2.3.
 [9] (2020) Triggering proactive business process adaptations via online reinforcement learning. In BPM, pp. 273–290. Cited by: §1, §1, §2.2.
 [10] (2019) Orthogonal random forest for causal inference. In ICML, pp. 4932–4941. Cited by: §2.3.
 [11] (2017) Bpic 2017: density analysis of the interaction with clients. BPI Challenge. Cited by: §3.1.
 [12] (2018) Temporal stability in predictive process monitoring. Data Min. Knowl. Discov. 32 (5), pp. 1306–1338. Cited by: §2.1.
 [13] (2019) Outcomeoriented predictive process monitoring: review and benchmark. ACM TKDD 13 (2), pp. 1–57. Cited by: §2.1, §3.1.
 [14] (2020) Prescriptive business process monitoring for recommending next best actions. In BPM Forum, Vol. 392, pp. 193–209. Cited by: §2.2.
 [15] (2020) Causality learning: a new perspective for interpretable machine learning. arXiv preprint arXiv:2006.16789. Cited by: §2.3.