Log In Sign Up

What if Process Predictions are not followed by Good Recommendations?

Process-aware Recommender systems (PAR systems) are information systems that aim to monitor process executions, predict their outcome, and recommend effective interventions to reduce the risk of failure. This paper discusses monitoring, predicting, and recommending using a PAR system within a financial institute in the Netherlands to avoid faulty executions. While predictions were based on the analysis of historical data, the most opportune intervention was selected on the basis of human judgment and subjective opinions. The results showed that, while the predictions of risky cases were relatively accurate, no reduction was observed in the number of faulty executions. We believe that this was caused by incorrect choices of interventions. While a large body of research exists on monitoring and predicting based on facts recorded in historicaldata, research on fact-based interventions is relatively limited. This paper reports on lessons learned from the case study in finance and proposes a new methodology to improve the performances of PAR systems. This methodology advocates the importance of several cycles of interactions among all actors involved so as to develop interventions that incorporate their feedback and are based on insights from factual, historical data.


page 1

page 2

page 3

page 4


What if Process Predictions are not followed by Good Recommendations? (Technical Report)

Process-aware Recommender systems (PAR systems) are information systems ...

Intervening With Confidence: Conformal Prescriptive Monitoring of Business Processes

Prescriptive process monitoring methods seek to improve the performance ...

Outcome-Oriented Prescriptive Process Monitoring Based on Temporal Logic Patterns

Prescriptive Process Monitoring systems recommend, during the execution ...

Prescriptive Process Monitoring Under Resource Constraints: A Causal Inference Approach

Prescriptive process monitoring is a family of techniques to optimize th...

Prescriptive Process Monitoring: Quo Vadis?

Prescriptive process monitoring methods seek to optimize a business proc...

Sequential Nature of Recommender Systems Disrupts the Evaluation Process

Datasets are often generated in a sequential manner, where the previous ...

Investigating Opinion Dynamics Models in Agent-Based Simulation of Energy Eco-Feedback Programs

According to research, reducing consumer energy demand through behaviour...

1 Introduction

Process-aware Recommender systems (hereafter shortened as PAR systems) are a new breed of information systems. They aim to predict how the executions of process instances are going to evolve in the future, to determine those that have higher chances to not meet desired levels of performance (e.g., costs, deadlines, customer satisfaction). Consequently recommendations are provided on which effective contingency actions should be enacted to try to recover from risky executions. PAR systems are expert systems that run in the background and continuously monitor the execution of processes, predict their future, and, possibly, provide recommendations. A substantial body of research exists on evaluating risks111Note that, in fact, overspending is a special type of risk and, hence, cost evaluations are special types of risk evaluation, also known as process monitoring and prediction; see, e.g., publications [4, 5, 8, 12, 14, 15, 17] and the survey [18]. Yet, as also indicated in [10], “existing works on interventions, i.e. mitigating [actions] are rare”. In fact, it has often been overlooked how process participants would use these predictions to enact appropriate actions to recover from those executions that have a higher risk of causing problems. One of the few research works that addresses this is [4]. It seems that process participants are tacitly assumed to take the “right decision” for the most appropriate corrective actions for each case. This also holds for approaches based on mitigation / flexibility “by design” [10] or on operational research [3]. Unfortunately, the assumption of selecting an effective corrective action is not always met in reality. When selecting an intervention, this is mainly done based on human judgment, which naturally relies on the subjective perception of the process instead of being based on objective facts.

We stress here the importance of building interventions on facts too: the system should reason on past process executions to correlate alternative corrective actions with the likelihood of being effective; it should then recommend the action that is most likely to decrease risks. Otherwise, even if the monitor is able to drive the attention of process participants to those executions that actually require support, the recommender system is destined to ultimately fail. The positive occurrence of correctly monitoring a process and making an accurate prediction can be nullified by an improper recovery or intervention. An organization will only profit from using a recommender system if (a) the system is capable of making accurate decisions and (b) the organization is capable of making effective decisions on the basis of this. Much attention is being paid to (a), specifically to the proper use of data, measuring accuracy, etc. In this work, we show that the analysis of (b) is just as important. Both parts are essential ingredients of an overall solution.

In order to support our arguments, this paper reports on a field experiment that we conducted within UWV, the Dutch governmental agency. Among other things, UWV provides financial support to Dutch residents that lose their job and seek a new employment. Several subjects (hereafter often referred to as customers) receive more unemployment benefits than the amount they are entitled to. While this is eventually detected, it may take several months. Using the UWV’s terminology, a reclamation is created when this happens. To reclaim the amount of unlawfully provided support is very hard, time-consuming, and, often unsuccessful. In this context, an effective recommender system should be able to detect the customers who are more likely to get a reclamation and provide operational support to prevent the provision of benefits without entitlement. 222Research at UWV has shown that the main causes for reclamations can be attributed to the customer making a mistake when informing UWV about the income they receive next to their benefits.

To follow up on this idea, we developed a prediction system that relies on machine-learning techniques to monitor and identify the subjects who are more likely to receive unlawful support. Next, various possible interventions to prevent reclamations were considered by UWV’s stakeholders. The intervention that was selected to be tested in a field experiment consists of sending a specific email to the subjects who were suspected of being at higher risk.

The results show that risky customers were detected rather well, but no significant reduction of the number of reclamations was observed. This indicates that the intervention did not achieve the desired effect, which ultimately means that the action was not effective in preventing reclamations. Our findings show the importance of conducting research not only on prediction but also on interventions. This is to ensure that the PAR system will indeed achieve the improvements that it aims at.

The rest of this paper is organized as follows. Section 2 discusses the research method that was used. Section 3 introduces the case study at UWV. Section 4 addresses building the PAR system used in the case study. In Section 5 we go into the design, execution and results of the field experiment. Section 6 discusses these results and formulates the lessons learned. Section 7 contains the conclusion and the next steps.

2 Research Methodology

Our research approach is based on the principle that researchers and practitioners work in concert. The practitioners involvement focuses on improving the process, while the researchers investigate how the improvement process itself is executed.

Figure 1: Overview of the steps that make up the research method. These steps correspond to one improvement cycle. The “I” is used as an abbreviation for “Intervention”.

Figure 1 illustrates our approach to our study on the development of a PAR system for UWV. All the steps of this approach are executed by both practitioners and researchers together. The first steps (Step 1a and 1b) of the approach are to analyze and identify the organizational issue. Section 3 describes the organizational issue at UWV, which, as mentioned above, is related to reclamations.

The second step is to develop a recommender system, which consists of a predictor module (Step 2a) and a set of interventions (Step 2b). The prediction unit is needed to identify the cases on which the interventions should be applied, namely the cases with the highest risk to incur in reclamations. Together with the predictor module, an appropriate set of interventions needs to be selected. Interventions need to be determined in concert with stakeholders of UWV. Only by doing this together, interventions that have the support of the stakeholders can be identified. Support for the interventions is needed to also get support for the changes necessary to implement the interventions in the process. Section 4.1 and Section 4.2 respectively describe how the predictor was built and how interventions were collected.

At UWV several alternatives were put forward, from which one was chosen (Step 3). Only one intervention could be selected, due to the limited availability of resources at UWV to execute an experiment. With a predictor module and an intervention in place, the next step is to design a field experiment (Step 4). The field experiment was set up as an A/B test [9]. In an A/B test, one or more interventions are tested under the same conditions, to find the alternative that best delivers the desired effect. In our field experiment, risk type combined with the intervention can be tested in the natural setting of the process environment. The objective of the field experiment is to determine the effect of applying an intervention for cases at a specific risk level, with respect to the specific process metrics of interest, i.e. whether or not a customer gets a reclamation. All other factors that can play a role in the field experiment are controlled, as far as this is possible in our business environment. Especially when seasonal effects are present, a control group is needed to be able to distinguish seasonal effects from experimental effects. Under these conditions, the field experiment will show if a causal relation exists between the intervention and the change in the values of the process metrics. Section 5.1 describes the setup for the UWV study.

The results of the field experiment are analyzed to determine if an effect can be detected from applying the intervention (Step 5). The desired effect is a reduction in the number of customers with a reclamation. Section 5.2 and Section 5.3 contain respectively the analysis of the intervention and the predictor module. If the intervention can be identified as having an effect, then both the direction of the effect, i.e. whether the intervention leads to better or worse performance, and the size of the effect need to be calculated from the data. When an intervention has the desired effect, it can be selected to become a regular part of the process. The intervention then needs to be implemented in the process (Step 6). Interventions together with the predictor module from Step 2a, make up the PAR system. After the decision to implement an intervention it is necessary to update the predictor part of the PAR system. Changing the process also implies that the situation under which the predictions are made has changed. Some period of time after the change takes effect, needs to be reserved to gather a new set of historic process data on which the predictor component can be retrained.

The final step (Step 7) is the reflective phase in which the lessons learned from the execution of the approach are discussed. Within this research method, many choices need to be made. For example, which organizational issue will be tackled and which interventions will be tested. Prior to making a choice, the research participants should be aware of any assumptions or bias that could influence their choices. However, nobody is free of bias. Only by reflecting on the outcome of the choices that were made, the practitioners and researchers can overcome their preconceptions. Section 6 contains the discussion and lessons learned for the UWV case.

3 The Unemployment-Benefits Process at UWV

UWV is the social security institute of the Netherlands and responsible for the execution of a number of employee-related insurances. One of the processes that UWV executes is the unemployment benefits process. When residents in the Netherlands become unemployed, they need to file a request at UWV, which then decides if they are entitled to benefits. When requests are accepted, the customers receive monthly benefits until they find a new job or until the maximum period for their entitlement is reached.

Figure 2: An example scenario of the potential activities that are related to the provision of the unemployment benefits for a customer for the months June, July and August (the year is irrelevant). Each row is related to the activities needed to handle an income form for the month of the benefits. Each benefits month takes several calendar months to be handled, e.g., the benefits for the month of June are handled from June until August.

The unemployment benefit payment process is bound by legal rules. Customers and employees of UWV are required to perform certain steps for each specific month (hereafter income month) in which customers have an entitlement. Figure 2 depicts a typical scenario of a customer who receives benefits, with the steps that are executed in each calendar month. Before a customer receives a payment of benefits for an income month, an income form has to be sent to UWV. Through this form customers specify whether or not they received any kind of income next to their benefits, and, if so, what amount. The benefits can be adjusted monthly as a function of any potential income, up to receiving no benefits if the income exceeds the amount of benefits to which the customer is entitled.

Fig. 2 clearly shows that, in October, when the reclamation is handled, two months of unemployment benefits have already been paid, possibly erroneously. While this seems a limited amount (usually a few hundred Euros) if one looks at a single customer, it should be realized that this needs to be multiplied by tens of thousands of customers in the same situation.

UWV has on average 300,000 customers with unemployment benefits of whom each month an average of 4% incurs into a reclamation. Since the reclamations are caused by customers filling in income forms incorrectly, the only thing that UWV can do is to try to prevent customers from making mistakes filling in the income form. Unfortunately, targeting all customers with unemployment benefits every month to prevent reclamations can become very expensive. Furthermore, UWV wants to limit communications to customers to only the necessary contact moments. Otherwise, communication fatigue can set in with the customers, causing important messages of UWV to have less impact with the customers. Only targeting customers with a high chance of getting a reclamation reduces costs and should not influence the effectiveness of messages of UWV. Because of all these reasons, a recommender system that could effectively identify customers with a high risk of getting a reclamation would be really helpful for UWV. That recommender system needs to be able to target risky customers and propose opportune interventions.

4 Development of the PAR System for UWV

In line with Step 2a and Step 2b of the approach described in Section 2, we developed a PAR system for the UWV case, i.e. a predictor module and interventions. The basic foundation of any recommender system is to be able to predict cases that are likely to incur problems. Once a case is predicted as “risky”, an intervention needs to be executed. While the implementation of the predictor component is specialized for the UWV case, the solution is highly generic. Section 4.1 discusses the implementation of the prediction part and how this has been customized for the UWV case. Section 4.2 illustrates how the choice was made on the interventions.

4.1 Building the Prediction Component

The prediction is based on training a predictor component which uses historical data. This component was implemented as a stand-alone application in Python and leveraged the sci-kit learn [13] library to access the data-mining functionality. For the UWV case, the historical data was extracted from the company’s systems. It relates to the execution of every activity for more than 73,000 customers who concluded the reception of unemployment benefits in the period from July 2015 until July 2017. Note that these customers, which were used to train the predictor, differ from the customers on whom the predictor was applied during the experiment.

Figure 3: An example of an event-log fragment for two of UWV’s customers. Each row refers to an event; events with the same Customer ID are grouped into traces and ordered by Event date.

The collected information can be represented in a tabular form, of which an excerpt is presented in Figure 3. Each row in Fig. 3 corresponds to an event, namely the execution of an activity at a certain moment in time and refers to a customer with an identifier and other given characteristics. The table forms a classical event log [1]. Events referring to the same customer can be grouped by customer id and ordered by timestamp, thus obtaining a trace. For the UWV case study, the event log contained 5 million events (i.e. rows) for 73,153 customers, i.e. the event log contained 73,153 traces. Every trace refers to a complete execution of the process to provide benefits, which can end with finding a job or with reaching the end of the maximum benefit provision. A trace can contain zero, one, or more reclamation events.

The classifier of the prediction module is trained using the traces of the UWV’s event log as input. Similarly to what is proposed in 

[17, 18], each trace

is encoded into a vector of variables that contains:

  1. The number of executions of each process’ activity in (one numeric variable per activity);

  2. the number of months for which the unemployment benefit can be maximally given (one numeric variable);

  3. the duration of the process execution in terms of number of months, i.e. the number of months existing between and (one numeric variable);

  4. customer characteristics, such as age, gender, and marital status;

  5. properties of the employment that triggered the unemployment benefits like the sector, the type of contract, working pattern and the reason for the dismissal;

  6. the presence / absence of a reclamation (one Boolean variable) at the end of .

Figure 4: Example of vectors that are used as instances to train the predictor. These vectors correspond to the excerpt of the event log in Fig. 3

Since we want to predict running cases and the event log records completed cases, we need to consider prefixes of running process instances. Namely, if a trace is composed by events, we build prefixes: , . These prefixes are treated as running cases, with the notable difference that the eventual, actual outcome is known. In this way, the prefixes are a suitable input for training the predictor. Figure 4 shows the set of training vectors that are generated for the two traces depicted in Fig. 3. Each prefix is encoded as mentioned above, which includes the Boolean variable about the presence / absence of a reclamation at the end of the whole , called Indication of reclamation. This variable is used as the dependent variable, where the others are used as independent variables to correlate with the dependent. As an example, the first instance refers to the execution for customer when only the first activity Initialize the Income Form was considered; the second is about the same customers when the first and second activity were accounted.

Teinemaa et al. illustrate that several data-mining predictors can be used to predict the dependent variable that encodes the KPI outcome [18]

, ranging, e.g., from Decision Tree, Random Forest, and Support Vector Machine till Generalized Boosted Regression Models, Logistic Regression, and ADA Boost. For our experiments, we decided to opt for Logistic Regression and ADA Boost because they provide a predicting model that allows one to analyze which vector’s components are most heavily affecting the prediction (e.g., the beta value of Logistic Regression). As discussed in Section 

3, the frequency with which activities are executed for each customer is of the order of once a month. Therefore, it is not worthwhile predicting and recommending more than once a month. So only prefixes referring to entire months are retained; in other words, we train the predictor using the prefix of trace if belongs to the month that follows that of . E.g., looking at Fig. 4 for customer , we train on the prefixes of length 1, 4, and 7, because these represent the last prefixes before the next month starts.

The techniques based on Logistic Regression and ADA Boost were tuned through hyper-parameter optimization [2]. To this end, the UWV’s event log was split into a training set with 80% of the traces and a test set with 20% of the traces. The models were learned through a 5-fold cross validation on the training set, using different configurations of the algorithm’s parameters. The models trained with different parameter configurations were tested on the second set with 20% of the traces and ranked using the area under the ROC curve (shortened as AUC) [7]. AUC was chosen because it is the most suitable criterion in case of unbalanced classes: the number of customers with reclamation is just around 4% of the total number. When performing hyper-parameter optimization, we also tested two alternatives, as advised by Teinemaa et al. [17, 18]. The first alternative is to train a single predictor with the vectors referring to all prefixes referring to whole months (see discussion above about the prefixes retained). The second alternative was to cluster the vectors according to the length of the corresponding prefixes in months, and to assign each vector cluster to a different predictor. Therefore, one predictor was trained of the vector of prefixes spanning over one month, one predictor with those spanning over two months, etc. The outcome of the hyper-parameter optimization was that the second alternative generally led to higher AUC values in combination with the ADA Boost technique.

4.2 Collecting and Selecting the Interventions

After three brainstorm sessions, with 15 employees and 2 team managers of UWV, the choice of the intervention was made by the stakeholders. As mentioned earlier, the choice of intervention was based on the experience and expectations of the stakeholders. The sessions initially put forward three potential types of interventions. The types are defined based on the actors that are involved in the intervention (the customer, the UWV employee, or the last employer):

  1. the customer is supported in advance on how to fill the income form;

  2. the UWV employee verifies the correctness of the information provided by the customer in the income form, and, if necessary, corrects it after contacting the customer;

  3. the last employer of the UWV customer is asked to supply relevant information more quickly, so as to be able to promptly verify the truthfulness of the information provided by the customer in the income form;

An intervention can only be executed once a month, namely between two income forms for two consecutive months. In the final brainstorming session, out of the three intervention types, the stakeholders finally opted for option 1 in the list above, i.e. supporting the customer to correctly fill the income form. Stakeholders stated that, according to their experience, their support with filling the form helps customers reduce the chance of incurring in reclamations. As mentioned earlier, only one specific intervention was selected for the experiment, due to the limited availability of resources at UWV.

The selected intervention entails pro-actively informing the customer about specific topics regarding the income form, which are frequently filled in incorrectly. These topics relate to the definition of social security wages, financial unemployment and receiving 4-weekly payments instead of monthly payments. The UWV employees indicated that they found that most mistakes were made regarding these topics.

Next to deciding the action, the medium through which the customer would be informed, had to be determined. The options were: a physical letter, an email, or a phone call by the UWV employee. In the spirit of keeping costs low, it was decided to send the support information by email. An editorial employee of UWV designed the exact phrasing. The email contained hyperlinks to web pages of the UWV website to allow customers to obtain more insights into the support information provided in the email itself. A tool used by UWV to send emails to large numbers of customers at the same time provided functionality to check whether the email was received by the recipient, namely without a bounce, as well as whether the email is opened by the customer’s email client application. Since the timing of sending the message can influence the success of the action, it was decided to send it on the day preceding the last working day of the calendar month in which the prediction module marked the customer as risky. This ensured that the message could potentially be read by the customer before filling in the income form for the subsequent month.

5 The Design, Execution, and Analysis of the Field Experiment

Section 5.1 describes how we designed the field experiment and how and when we executed it, i.e. Step 4 in Fig. 1. Section 5.2 presents the results of the experiment to verify the effectiveness and efficacy of the prediction system and the selected intervention, i.e. Step 5 in Fig. 1. Finally, the prediction accuracy is discussed in Section 5.3.

5.1 Design and Execution

The experiment aims to determine whether or not the use of the PAR system would reduce the number of reclamations in the way it had been designed in terms of prediction and intervention. Specifically, we first determined the number and the nature of the customers who were monitored. Then, the involved customers were split into two groups: on one group the PAR system was applied, i.e. the experimental group, the second group was handled without the PAR system, i.e. the control group.

We conducted the experiment with 86,850 cases, who were handled by the Amsterdam branch of UWV. These were customers currently receiving benefits, and they are different from the 73,153 cases who were used to train the predictor module. Out of the 86,850 cases, 35,812 were part of the experimental group. The experiment ran from August 2017 until October 2017. On 30 August 2017, 28 September 2017 and 30 October 2017 the intervention of sending an email was executed. The predictor was used to compute the probability of having a reclamation for the 35,812 cases of the experimental group. The probability was higher than 0.8 for 6,747 cases, and the intervention was executed for those cases.

5.2 The Intervention Did Not Have a Preventive Effect

Figure 5: The number of cases and percentage of cases having a reclamation for all groups. The results show that risky customers are identified, but the intervention does not really help.

Figure 5 shows the results of the field experiment, where the black triangles illustrate the percentage of reclamations observed in each group. The triangles at the left-most stacked bar show that the number of reclamations did not significantly decrease when the system was used, i.e. from 4.0% without using the system to 3.8% while using the system. The effectiveness of the system as a whole is therefore 0.2%, which is certainly not statistically significant.

The second bar from the left shows how the PAR system was used for the customers: 6,747 cases were deemed risky and were e-mailed. Out of these 6,747 cases, 4,065 received the emails with the links to access further information. The other 2,682 cases did not receive the email. As mentioned in Section 4.2 the tool that UWV uses for sending bulk email can detect whether an email is received and is opened, i.e. there was no bounce. Since there were almost no bounces, the cases that did not receive the email, actually did not open the message in their email client. From the customers who have received the email, only 294 actually clicked on the links and accessed the UWV’s web site. Remarkably, among the customers who clicked the link, 10.9% of those had a reclamation in the subsequent month: this percentage is more than 2.5 times the average. Also, it is around 1.7 times of the frequency among the customers who received the email but did not click the links.

We conducted a comparative analysis among the customers who did not receive the email, those who received it but did not click the links and, finally, those who reached the web site. The results of the comparative analysis are shown in Figure 6. The results indicate that 76.5% of the customers who clicked the email’s links had an income next to the benefits. Recall that it is possible to receive benefits even when one is employed: this is the situation when the income is reduced and the customer receives benefits for the difference. It is a reasonable result: mistakes are more frequent when filling the income form is more complex (e.g., when there is some income, indeed). Additional distinguishing features of the customers who clicked on the email’s link are that 50.3% of these customers have had a previous reclamation, as well as that these customers are on average 3.5 years older.

Figure 6: Comparison of the characteristics of the customers who did not receive the email, those who received it but did not click the link and who accessed the UWV’s web site through the email’s link.

The results even seem to suggest that emailing appears counterproductive or, at least, that there was a positive correlation between exploring the additional information provided and being involved in a reclamation in the subsequent month. To a smaller extent, if compared with the average, a higher frequency of reclamations is observed among the customers who received the email but did not click the links: 6.2% of reclamations versus a mean of 3.8-4%. A discussion on the possible reasons for these results can be found in Section 6. However, it is clear that the intervention did not achieve the intended goal.

5.3 The Risk Was Predicted Reasonably Accurate

As already mentioned in Section 1 and Section 5.2, the analysis shows the experiment did not lead to an improvement. To understand the cause, we analyzed whether this was caused by inaccurate predictions or an ineffective intervention or both. In this section, we analyze the actual quality of the predictor module. We use the so-called cumulative lift curve [11] to assess the prediction model. This measure is chosen because of the imbalance in the data as advised in [11]. As mentioned before in Section 3

, only 4% of the customers are eventually involved in reclamations. In cases of unbalanced data sets (such as between customers with and those without reclamations), precision and recall are less suitable to assess the quality of predictors. Furthermore, because of the low cost of the intervention of sending an email, the presence of

false negatives, here meaning those customers with undetected reclamations during the subsequent month, is much more severe than false positives, i.e. customers who are wrongly detected as going to have reclamations during the subsequent month.

Figure 7: The cumulative lift curve shows that using the recommender system leads to a better selection of cases than using a random selection of cases.

Figure 7 shows the curve for the case study at UWV. The rationale is that, within a set of of randomly selected customers, one expects to observe of the total number of reclamations. This trend is shown as a dotted line in Fig. 7. In our case, the predictions are better than random. For example, the 10% of customers with the highest risk of having a reclamation accounted for 19% of all reclamations, which is roughly twice as what can be expected in a random sample.

In summary, while the prediction technique can certainly be improved, a considerable prediction effectiveness can be observed (cf. Section 4.1). However, as mentioned in Section 5.2, the system as a whole did not bring a significant improvement. This leads us to conclude that the lack of a significant effect should be mostly caused by the ineffectiveness of the intervention. In Section 6, we discuss this in more detail.

6 Discussion and Lessons Learned

The experiment proved to be unsuccessful. On the positive side, the predictions were reasonably accurate. However, the intervention to send an email to high risk customers did not lead to a reduction in the number of reclamations. There even was a group of customers who had twice as many reclamations as the average population. Section 6.1 elaborates on the reasons why the intervention did not work. Section 6.2 focuses on the lesson learned, delineating how the research methodology needs to be updated.

6.1 Why Did the Intervention Not Work?

One of the reasons why the intervention was not successful might be related to the wrong timing of sending the email. A different moment within the month could have been more appropriate. However, this does not explain why of the 6,747 cases selected only 294 acted on the email by clicking the links.

Alternative reasons may be that the customers might have found the email message unclear or that the links in the email body pointed to confusing information on the UWV website. In the group of 294 cases who clicked the links and who took notice of this information a reclamation actually occurred 2.5 times as much.

Also, the communication channel could be part of the cause. Sending the message by letter, or by actively calling the customer might have worked better. In fact, when discussing reasons of the failure of the experiment, we heard several comments from different stakeholders that they did not expect the failure because “after speaking to a customer about how to fill in the income form, almost no mistakes are made by that customer” (quoted from a stakeholder). This illustrates how the subjective feelings can be far from objective facts.

6.2 What Should be Done Differently Next Time?

We certainly learned that the A/B testing is really beneficial to assess the effectiveness of interventions. The involvement of stakeholders and other process participants, including, e.g., the UWV’s customers, is beneficial towards achieving the goal. However, the results did not achieve the expected results. We learned a number of lessons to adjust the experiments that we will put in place for the next round of the experiments:

  1. Creating a predictor module requires the selection of independent features as inputs to build the predictive model. From the reflection and the analysis of the reasons that caused the failure of an intervention, one can derive interesting insights into new features that should be incorporated when training the predictor. For instance, the features presented in Fig. 6 can be used to train a better predictor for the UWV case. These features could be, e.g., a boolean feature whether a customer has income next to the benefits.

  2. The insights discussed in the previous point, which can be derived from the analysis, can also be useful to putting forward potential interventions. For instance, an intervention could be to perform a manual check of the income form when a customer has had a reclamation in the previous month. This intervention example is derived from the feature representing the number of executions of Detect Reclamation as discussed in Section 4.1.

  3. Before the selection of the interventions for the A/B test (Step 3 in Fig. 1), they need to be pre-assessed. The intervention used in our experiment is about providing information to the customers concerning specific topics related to filling the income form. In fact, before running the experiments, we could have already checked on the historical event data whether the reclamations were on average fewer when providing information and support to fill the income form. If this would had been observed, we could prevent ourselves from running experiments destined to fail.

  4. Since a control group was compared with another group on which the system was employed and the comparison is measured end-to-end, it is impossible to state the reason of the failure of the intervention, beyond just observing it. For instance, we should have used some questionnaires to assess the reasons of the failure: the customers that received the email should have been asked why they did not click on the links or, even if clicked, still were mistaken. Clearly, questionnaires are not applicable for other interventions. Different methods also have to be envisaged to acquire the information needed to analyze the ineffectiveness of an intervention.

  5. It is unlikely that the methodology in Section 2 already provided satisfactory results because of the methodology needs to be iterated in multiple cycles. In fact, this finding is compliant with the principle of Action Research, which is based on idea of continuous improvement cycles of interaction among process stakeholders, participants and researchers [6, 16].

  6. The point above highlights the importance of having interaction cycles. However, one cycle took a few months to be carried out. This is certainly inefficient: the whole cycle needs to be repeated at high speed and multiple interventions need to be tested at each cycle. Furthermore, if an intervention is clearly ineffective, the corresponding testing needs to be stopped without waiting for the cycle to end.

Figure 8: Overview of the steps that make up the updated research method. These steps correspond to one improvement cycle and are repeated in every cycle. The “I” is used as abbreviation for “Intervention”. The components that are changed relative to Fig. 1 have red dashed lines.

In light of this, the methodology introduced in Section 2 needs to be adjusted; the resulting new methodology is shown in Figure 8. The changes relative to Fig. 1 are shown in red dashed lines. To show how the lessons learned have impacted the original methodology, the numbered items of the previous list are mapped on Fig. 8 as numbers within a red circle.

The impact of adapting the research method according to the lessons learned is not limited to the identified components. For example, the second lesson has impact on the collection of interventions. Generating interventions in a data-driven manner is added to the stakeholder-driven approach. The third lesson adds the new pre-assessment step to the approach (Step 2c). The result of this step is the deselection of interventions collected in Step 2b. The fourth lesson introduces Step 3b, in which the information needed to understand the (in)effectiveness of an intervention is defined. Defining this information has an impact on the design of the A/B Test and the analysis of the results in Step 5. For example when questionnaires need to be deployed.

Lesson 5 and 6, i.e. repeat the cycle, speed it up and use multiple interventions, are not linked to one specific step. These lessons have impact on the whole approach. Since the updated approach is more elaborate than the original approach it will require more effort to execute one cycle of this method, let alone multiple cycles with multiple interventions at a high speed. Systematic support needs to be developed for all of the steps of the updated research methodology to allow for a smooth execution.

7 Conclusion

When building a Process-aware Recommender system, both the predictor and the recommender parts of the system must effective in order for the whole system to be effective. In our case, the predictor module was accurate enough. However, the intervention did not have the desired effect. The lessons learned from the field experiment are translated into an updated research method. The updated approach asks for high speed iterations with multiple interventions. Systematic support will be needed for each step of the approach to meet these requirements.

As future work, we plan to improve the predictor module to achieve better predictions by using different techniques and leveraging on contextual information about the customer and its history . As an example, our analysis showed that the presence of some monetary income next to the benefits is strongly causally related to reclamations. In the spirit of action research [6, 16] we also plan to have further sessions with the UWV employees to be advised on new features for the predictor module and new potential interventions. After a few of those sessions, we will design and execute a new field experiment to find an effective intervention against reclamations. As described, we want to use evidence from the process executions, and insights from building the predictor module, to select one or more interventions to be tested in the new experiment.

Orthogonally to a new field experiment, we aim to devise a new technique that adaptively finds the best intervention based on the specific case. Different cases might require different interventions, and the choice of the best intervention should be automatically derived from the historical facts recorded in the system’s event logs. In other words, the system should feature machine-learning techniques that (1) reason on past executions to find the interventions that have generally been more effective in the specific cases, and (2) recommend accordingly.


  • [1]

    van der Aalst, W.M.P.: Process Mining - Data Science in Action, 2nd Edition. Springer (2016)

  • [2] Claesen, M., Moor, B.D.: Hyperparameter search in machine learning. CoRR abs/1502.02127 (2015)
  • [3] Conforti, R., ter Hofstede, A.H.M., La Rosa, M., Adams, M.: Automated risk mitigation in business processes. In: Meersman, R., Panetto, H., Dillon, T., Rinderle-Ma, S., Dadam, P., Zhou, X., Pearson, S., Ferscha, A., Bergamaschi, S., Cruz, I.F. (eds.) On the Move to Meaningful Internet Systems: OTM 2012. pp. 212–231. Springer Berlin Heidelberg (2012)
  • [4] Conforti, R., de Leoni, M., Rosa, M.L., van der Aalst, W.M., ter Hofstede, A.H.: A recommendation system for predicting risks across multiple business process instances. Decision Support Systems 69, 1 – 19 (2015).
  • [5] Conforti, R., Rosa, M.L., Fortino, G., ter Hofstede, A.H., Recker, J., Adams, M.: Real-time risk monitoring in business processes: A sensor-based approach. Journal of Systems and Software 86(11), 2939 – 2965 (2013).
  • [6] Cronholm, S., Goldkuhl, G.: Understanding the practices of action research. In: The 2nd European Conference on Research Methods in Business and Management (ECRM 2003), Reading, UK, 20–21 March 2003 (2003)
  • [7] Fawcett, T.: An introduction to roc analysis. Pattern Recogn. Lett. 27(8), 861–874 (Jun 2006)
  • [8] Fazzinga, B., Flesca, S., Furfaro, F., Pontieri, L.: Online and offline classification of traces of event logs on the basis of security risks. Journal of Intelligent Information Systems 50(1), 195–230 (02 2018).
  • [9] Kohavi, R., Longbotham, R.: Online Controlled Experiments and A/B Testing, pp. 922–929. Springer US, Boston, MA (2017)
  • [10] Lhannaoui, H., Kabbaj, M.I., Bakkoury, Z.: Towards an approach to improve business process models using risk management techniques. In: 2013 8th International Conference on Intelligent Systems: Theories and Applications (SITA). pp. 1–8 (05 2013)
  • [11] Ling, C.X., Li, C.: Data mining for direct marketing: Problems and solutions. In: Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD-98), New York City, New York, USA, August 27-31, 1998. pp. 73–79 (1998)
  • [12]

    Metzger, A., Föcker, F.: Predictive business process monitoring considering reliability estimates. In: Dubois, E., Pohl, K. (eds.) Advanced Information Systems Engineering. pp. 445–460. Springer International Publishing, Cham (2017)

  • [13] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, 2825–2830 (2011)
  • [14] Pika, A., van der Aalst, W., Wynn, M., Fidge, C., ter Hofstede, A.: Evaluating and predicting overall process risk using event logs. Information Sciences 352-353, 98 – 120 (2016)
  • [15] Plebani, P., Marrella, A., Mecella, M., Mizmizi, M., Pernici, B.: Multi-party business process resilience by-design: A data-centric perspective. In: Dubois, E., Pohl, K. (eds.) Advanced Information Systems Engineering. pp. 110–124. Springer International Publishing (2017)
  • [16] Rowell, L.L., Riel, M.M., Polush, E.Y.: Defining Action Research: On Dialogic Spaces for Constructing Shared Meanings, pp. 85–101. Palgrave Macmillan US, New York (2017)
  • [17] Teinemaa, I., Dumas, M., Maggi, F.M., Francescomarino, C.D.: Predictive business process monitoring with structured and unstructured data. In: Business Process Management - 14th International Conference, BPM 2016. Proceedings. pp. 401–417 (2016)
  • [18] Teinemaa, I., Dumas, M., Rosa, M.L., Maggi, F.M.: Outcome-oriented predictive process monitoring: Review and benchmark. CoRR abs/1707.06766 (2017)