A Novel Approach to Detect Redundant Activity Labels For More Representative Event Logs

by   Qifan Chen, et al.
The University of Sydney

The insights revealed from process mining heavily rely on the quality of event logs. Activities extracted from healthcare information systems with the free-text nature may lead to inconsistent labels. Such inconsistency would then lead to redundancy of activity labels, which refer to labels that have different syntax but share the same behaviours. The identifications of these labels from data-driven process discovery are difficult and rely heavily on resource-intensive human review. Existing work achieves low accuracy either redundant activity labels are in low occurrence frequency or the existence of numerical data values as attributes in event logs. However, these phenomena are commonly observed in healthcare information systems. In this paper, we propose an approach to detect redundant activity labels using control-flow relations and numerical data values from event logs. Natural Language Processing is also integrated into our method to assess semantic similarity between labels, which provides users with additional insights. We have evaluated our approach through synthetic logs generated from the real-life Sepsis log and a case study using the MIMIC-III data set. The results demonstrate that our approach can successfully detect redundant activity labels. This approach can add value to the preprocessing step to generate more representative event logs for process mining tasks in the healthcare domain.



There are no comments yet.


page 1

page 2

page 3

page 4


Discovering Redundant Activities in Event Logs for the Simplification of Process Models

Process mining acts as a valuable tool to analyse the behaviour of an or...

JXES: JSON Support for the XES Event Log Standard

Process mining assumes the existence of an event log where each event re...

Extracting Semantic Process Information from the Natural Language in Event Logs

Process mining focuses on the analysis of recorded event data in order t...

Supporting Domain Data Selection in Data-Enhanced Process Models

Process mining bridges the gap between process management and data scien...

Business Process Variant Analysis based on Mutual Fingerprints of Event Logs

Comparing business process variants using event logs is a common use cas...

Generating Time-Based Label Refinements to Discover More Precise Process Models

Process mining is a research field focused on the analysis of event data...

Comparing decision mining approaches with regard to the meaningfulness of their results

Decisions and the underlying rules are indispensable for driving process...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Process mining (PM) is a technology known to be useful for understanding business processes using event logs captured in information systems [29]. It has shown promising potential in many aspects, including discovering significant insights and improving process performances. A typical event log refers to a collection of events, each with a timestamp that records the executed time. An event represents a unique execution of an activity, which is a well-defined step in the process, such as “doctor appointment”. Cases group these events, also called process instances. For example, a case could be a patient who follows a treatment process in a hospital.

In PM, data quality is an critical issue for generating useful process models. Data quality on activity labels in event logs are unique for PM research, as the quality can be affected by integrating data sources or discrepancies of labeling in the same information system. A particular issue is redundant activity labels which refer as synonymous and polluted labels in [24]. Additionally, redundant labels are not meant to have same semantics as long as they represent the same activity in reality. This type of redundancy can introduce unnecessarily complication to event logs and complexity to discovered models. One contributing factor to such redundancy is data integration from separate systems (e.g. EMRs (Electronic Medical Record) from different hospitals) because multiple systems use different names for the same concept (e.g. “Take Temperature” and “Temperature C” in figure 1). The other one is mainly due to the free-text input or human error providing an initial suggestion [24] (e.g. “BloodPressure” and “blood pressure” in figure 1) in the same system.

With efforts being made to address redundant activity labels [2, 13, 14, 19, 23, 24], many of these approaches have difficulties in identifying activity labels with low occurrence frequencies, or invalid labels have been used. Many of these approaches rely on event logs with resources and various kind of attributes [2, 24] or domain knowledge [19, 23] to improve the data quality. If those additional resources or knowledge are not available, these methods have to solely rely on control-flow relations and available data values in event logs to generate better process models.

The aim of this paper is to propose a novel approach to detect both frequent and infrequent redundant activity labels by efficiently incorporating control-flow relations and data values in event logs. For the control-flow perspective (i.e. the ordering of activities), we adopt a statistical method Earth Mover’s Distance (EMD) to compare directly and indirectly-follows relations of different activity labels. For the data value perspective (i.e. numerical values recorded of activities), activity labels are firstly clustered with Agglomerative Hierarchical Clustering and followed by EMD to compare the data’s distribution. A consensus is guided by a decision-making mechanism to integrate results generated from multiple perspectives. We evaluate our approach using three publicly available events logs. Two logs come from different Dutch hospitals, and a case study using the MIMIC-III data set

[11] has been conducted to further demonstrate our approach’s usability in real-life situations.

The paper is structured as follows. Section 2 discusses the background. Section 3 introduces the basic concepts used throughout the paper. In Sect. 4, we explain the main approaches for two different perspectives and the method to combine results. In Sect. 5, we describe the evaluations using two real-life event logs and the comparisons with the existing approach. A real-life case study is explored in Sect. 6. The paper concludes with Sect. 7.

Figure 1: A motivation example (left) and the output model after our approach (right)

2 Background

Event log quality has been identified as a critical issue that affects the PM result in both process discovery and improvement [29]. PM Manifesto [30] has emphasized the importance of event log quality. The first guideline for PM is to treat event data as first-class citizens. Later on, [27] proposes 11 event log imperfection patterns, including incorrect timestamp and polluted labels. [8, 18] suggest quality frameworks for assessing EMR data in the healthcare. Also, [4, 28] raise the concern for event data quality in PM. Therefore, it is useful to address data quality as early as the event log level.

Moving onto detecting redundant activity labels, [23, 24] suggest two ways to deal with this issue in the event log level, which are most relevant works. [24]

proposes a contextual approach that takes control-flow relations, resources, time and data attributes into consideration. For control-flow perspective, it reports the similarity between rows of the footprint matrix, which may not well distinguish the frequency difference between two activity labels and suffers from noisy or infrequent relations. The remaining perspectives largely adopt probability density function (PDF) to assess the value distributions between activity labels. It relies on a weighted clustering method to combine final results, which requires domain knowledge to determine the best weight settings. The other method

[23] collaboratively and interactively detects problematic activity labels by using gamified crowdsourcing approach, which using gamification elements (e.g. badges) to encourage large group of domain experts to identify and repair redundant activity labels.

One approach is process matching through the model level [2, 7, 13, 21]. These approaches match two process models from different data sources with the aim to find similar structures and activity labels. It is hard for them to deal with redundant labels within the same log since separated logs may have incomplete processes. Hence, they are more widely used in process similarity comparison instead of solving problematic logs. Another approaches like [14, 19] directly look at activity labels themselves while ignoring other information from logs, which may cause unwanted results. For instance, “Release A” and “Release C” will be treated as redundant in the hospital log since they have close string edit distance. However, these two labels represent different ways to discharge patients [17].

3 Preliminaries

Before proposing our novel ideas, this section introduces basic concepts needed in Sect. 4.

Definition 1 (Event log)

An event log is defined as L = (E,A,V,N,#,T) with: E is the set of unique event identifiers; A is the sets of activities; V is the sets of data values; N is the sets of numerical attribute names; #: E is a function that obtains data values recorded for an event . For example, gets the activity name for an event, gets the numerical data value for an event. is the set of traces over E. A trace records the sequence of events of a process instance. Each event only occurs once in a single trace.

Definition 2 (Basic Relations)
  1. Directly-Follows Relation: holds iff there is a trace where and and and .

  2. Indirectly-Follows Relation: holds iff there is a trace where and where and and . Also, it needs to pass the certain threshold using the long distance dependency measurement in [32].

Definition 3 (Directly-Follows Graph)

An directly-follows graph defined as with: is a finite set of activities in the event log (same as Definition 1), is a set of directed arcs, which represent directly-follows relations (i.e. exists if ), an example is shown in figure 2. For any , represents all the directly outgoing activities from , e.g. . Same for represents all the directly incoming activities to , e.g. . counts how many times the relation occurs in (e.g. ).

Figure 2: An example directly-follows graph

4 Methods

This section describes our approach, shown in figure 3

, to detect redundant activity labels. The underline assumption is that redundant activity labels should share the same patterns on both control-flow relations and numerical data values. So, our approach assesses similarities from above aspects using a statistical method Earth Mover’s Distance (EMD). To this end, we first introduce EMD to compare control-flow relations probability distributions. Then, we demonstrate how to extend EMD to calculate numerical data values similarity. Finally, we briefly describe how to use the decision-making mechanism to combine results from different perspectives to obtain the final output.

Figure 3: An overview of proposed method

4.1 Earth Mover’s Distance

The earth mover’s distance (EMD) [22]

is a method for comparing two multiple-dimensional probability distributions over a region. It was first proposed as a matrix to retrieve images in the computer vision domain. However, it has been applied to many other fields

[3, 33]. The EMD calculates the lowest costs of transferring one distribution into another, given two distributions are different ways of accumulating a certain amount of dirt in a region. A distance function defines the cost needed to move dirt between certain piles. For one thing, it is frequency-aware which considers the magnitude of discovered differences. For another, the difference is determined by the ground distance function that can express different perceptions of similarity [5].

Definition 4 (Earth Mover’s Distance)

Let P be a probability distribution with as different clusters and as the associated weight for these clusters. Another probability distributions Q with the same notations . A ground distance between cluster and is defined. We would like to find a flow that minimizes the overall costs to transfer P to Q. The following constraints should be followed:

  1. Non-negativity flow: .

  2. Sent and receive flow should not exceed weights in P and Q:

    • , ;

    • , .

  3. All weights possible have to be sent:

The optimal flow F is defined as:


4.2 Addressing Redundant by Measuring Control-flow Similarity

Principle. Redundant activity labels should share similar control-flow relations or ordering patterns. This kind of similarity means not only identical control-flow relations but also suggests close distribution patterns.

The overall idea, as shown in figure 3, behinds control-flow perspective is for each pair of activity labels , we adopt EMD to compare the directly-follows and the indirectly-follows relations along with their distributions. Each directly/indirectly-follows comparison can be further divided into directly/indirectly outgoing (i.e. consequence) and incoming (i.e. precedent) relations. Thus, we would get four different values, and the final similarity would be the average of these values.

The control-flow perspective is separated by directly-follows and indirectly-follows comparisons. We would like to put most effort into explaining directly-follows comparison, since indirectly-follows comparison is most likely to be the same, only the relations are indirectly-follows. The reason we also consider indirectly-follows relations is to deal with non-free-choice problems (i.e. whether to choose some tasks is dependent on what have been executed in the process before [9]). For instance, activity and in figure 2, have identical directly-follows relations, but (i.e. dashed line) also exists. Thus, and should not be regarded as redundant.

Algorithm 1 presents our approach for calculating the directly-follows similarity. The starting point is to construct a directly-follows graph obtained from the log (Line 1). Then for each activity label, we calculate its outgoing and incoming activity sets (Lines 2-4). By using Equation 2, the weights are calculated for each element in the activity set (Lines 5-6), (e.g. ). Afterwards, for each pair of activity labels, we adopt EMD to calculate the similarity between incoming and outgoing activity sets using the ground distance function from Equation 3 (Lines 7-9). The activities in the sets (e.g. ) are clusters. The weights in the sets (e.g. ) are the associated weights for each cluster. For instance, suppose we would like to calculate the similarity between outgoing activity sets for and in figure 2, the input signatures for EMD would be and . Lastly, the directly incoming and outgoing similarities are averaged to obtain the final directly-follows similarity for each pair of activity labels and added to the set (Lines 10-11).

The equation for calculating the weight of a single activity in incoming/outgoing activity sets is defined as:


The distance function for EMD between any two clusters from activity sets is defined as:


Principle. The same activity label have no cost, and different ones have unit cost. This cost function can be easily extended based on other matrices, e.g. global location for activities. Here, we just show the most basic version for better undesirability.

The same process applies to the calculation of indirectly incoming and outgoing similarities. We construct an indirectly-follows graph from the log. We have a set that contains indirectly-follows similarities as well. Then, for each pair of activity labels, the overall control-flow similarity is the average number of directly and indirectly-follows similarities, where is a value between 0 and 1. The greater the value is, the more significant effort needed to transfer one distribution into another, which means the two activity labels have less similarity with regard to control-flow perspective. The combination of four different scores can be easily extended with other statistical or clustering algorithms, e.g. weighted average or k-means clustering. We would like to show that our approach can achieve desirable results with the most fundamental method and require no domain knowledge, e.g. weight settings or the number of clusters as input.

Input: Event log
Output: : Set of Directly Similarities For All Pairs of Activities
1 MakeDirectlyFollowsGraph(L);
2 foreach  do
3       OutGoing(a);
4       InComing(a);
5       CalculateWeight();
6       CalculateWeight();
7foreach  do
8       Outgoing Similarity = EMD(;
9       Incoming Similarity = EMD(;
10       Directly-Follows Similarity = Average(Outgoing Similarity, Incoming Similarity);
11       Directly-Follows Similarity;
Algorithm 1 Directly-Follows Similarity

4.3 Addressing Redundant by Measuring Data Value Similarity

This section introduces the approach to calculate similarity for the data values perspective, e.g. the numerical results for medical tests. The overall approach, shown in figure 3, can be divided into two sub-stages. Firstly, we cluster each activity label into different clusters based on percentiles. Then, we apply EMD to assess data distributions of activity labels within each cluster. Clustering first ensures only activity labels with the same data range are further evaluated. Activities with different data ranges are unlikely to share similar data patterns, which no need to be further assessed for data distributions.

We describe our approach in Algorithm 2. For each activity, we first assess whether this activity has a data value attribute (Line 2). If not, it has minimum data value similarity with other activity labels (i.e. ). If yes, Line 3 finds all events of the activity () and obtain data values for the attribute () into a data set (i.e.

). Line 4 calculates 25th and 75th percentiles for each data set. We use 25th and 75th percentiles as a 2-D vector and apply Agglomerative Hierarchical Clustering

[12] with a threshold for all data sets (Line 5). Activity labels that are not in the same cluster also have . Since there are many unique values in the numerical data, it is hard to directly apply EMD because of so many different clusters in the distribution. As a result, we transfer each data set to a histogram following Sturges’ formula [26], where uniform maximum and minimum values are used to ensure two histograms have the same bin number and size when comparing activity label pairs within the same cluster (Line 11). We pick each interval’s left boundary as clusters (e.g. ) and lines 12-13 calculate the percentage of each bin as weights (e.g. ). EMD is further used to compare two histograms using distance function in Equation 4 (Line 14). The is normalized [10] to become a value between 0 and 1 and added to . Similar to the control-flow perspective, the greater the value is, the less similarity they have in the data value perspective.

The distance function for EMD between any two data value clusters from the histogram is defined as:


Principle: Since both are numerical values, it takes less effort to transfer to if they are close to each other. We adopt the difference between and as the ground distance function.

Input: Event log , threshold
Output: : Set of Data Value Similarities For All Pairs of Activities
1 foreach  do
2       if HasDataValueAttribute(a) then
3             ExtractData();
4             CalculatePercentiles();
6Clusters AgglomerativeHierarchicalClustering();
7 foreach  do
8       if size(C)  then
9             continue;
11      else
12             foreach  do
13                   MakeHistograms();
14                   CalculateWeight();
15                   CalculateWeight();
16                   Data Value Similarity = EMD();
17                   Data Value Similarity;
Algorithm 2 Data Value Similarity

4.4 Decision-making Mechanism to Aggregate Results

For now, each pair of activity labels has two similarities, which are control-flow relations and numerical data values. This section describes a decision-making mechanism to aggregate similarities from above two perspectives and generate final results. The decision-making mechanism is a set of rules that decide how the results are combined [25]. As shown in figure 3, rules can either be produced by thresholds or by domain experts to participate in the decision-making mechanism. For threshold rules, we have a threshold for each perspective (i.e. and ) to decide whether activity labels are similar in each dimension. For instance, activity labels that achieve similarities below than and in two perspectives are regarded as redundant. Besides, threshold rules can be easily extended with other features, e.g. frequency. To illustrate, activity labels with low frequency can be regarded as redundant if they achieve similarities below either than or in any perspective; otherwise, they need to satisfy both perspectives. We avoid asking users to determine the weight settings since different settings can significantly impact on the final results and are difficult to determine without domain knowledge.

5 Evaluation

We conducted three experiments to evaluate our approach using two publicly available event logs (i.e. Hospital Billing log111https://data.4tu.nl/articles/dataset/Hospital_Billing_-_Event_Log/12705113 and Sepsis log222https://data.4tu.nl/articles/dataset/Sepsis_Cases_-_Event_Log/12707639). The approach is implemented as a Python program333https://github.com/GilbertFan/Redundant-Activity-Detection for evaluations. Figure 4 illustrates the process for our evaluations. We follow the idea in [24] to generate experimental logs. For instance, we randomly selecting a certain amount of activity labels, and for each activity label, randomly renaming a percentage (e.g. 0.1% to 30%) of its events. For example, means 20% activity labels are selected and for each label, 0.1% of its events are renamed. For each setting, 5 logs are generated for more accurate results. Then, we run ours and the baseline approach [24]

to detect redundant activity labels. We display the results in terms of average recall and f-score. Recall is calculated because in reality, we are more interested in finding as much as positive classes as possible. To show that our approach can perform good results in both dimensions, the evaluations are divided into three aspects: control-flow similarity, data value similarity and overall results.

Figure 4: Experiment setup
Log #Trc #Trc vars #Evt Avg evt/trc #Act Used
Hospital Billing 100000 1020 451359 5 18 Sect. 5.1
Sepsis 1050 846 15214 14 16 Sect. 5.2 & 5.3
Case Study 683 683 529688 776 22 Sect. 6
Table 1: Characterististics of event logs used for evaluation and case study

5.0.1 Baseline

Among several existing methods, the SynonymousLabelRepair [24] seem to be more capable in handing redundant activity labels than other methods. We use its default settings for all aspects in evaluations.

5.1 Results Using Control-flow Similarity Approach

In this section, we evaluate our control-flow similarity measurements with the corresponding dimension in the baseline approach to show our approach can achieve satisfactory results no matter the redundant activity labels are infrequent or not. The goal of our approach is to use the fundamental information from logs, i.e. control-flow relations. The event log used is Hospital Billing log which recorded events related to billing medical services provided by a Dutch hospital. The details of the log are shown in table 1. In order to simulate redundant activity labels with low occur frequencies, the renaming percentage starts from 0.1% to 30%. In total, 36 experimental settings are used, and 180 logs are generated. We adopt 0.25 threshold for ours. Figure 5 shows the f-score comparison between ours and the baseline approach, where we outperform the baseline approach in all logs. Our approach is not sensitive to the frequency of activity labels, e.g. even only 1% of events are affected for each label, the f-score still exceeds 0.8, while the baseline achieves 0. That further demonstrate our approach is more consistent than the baseline, the range of our approach is only 0.26, while the baseline is 0.75. Figure 6 illustrates the average recall, where our approach is consistent and outperforming the baseline.

Figure 5: Control-flow similarity average f-score
Figure 6: Control-flow similarity average recall
Figure 7: Data value similarity evaluations
Figure 8: Overall results average f-score
Figure 9: Overall results average recall

5.2 Results Based on Data Value Similarity Approach

This experiment targets data value similarity, and the log used is Sepsis log [17], which records treatment processes of sepsis patients from a Dutch hospital. The details of the log are shown in table 1. Three activities with data value are picked (“CRP”, “LacticAcid” and “Leucocytes”) for renaming. This experiment follows the same strategy shown in figure 4. In total, 15 settings, along with 75 logs, are generated. Threshold 0.1 is applied for our approach. The results are displayed in figure 7. Our approach outperforms the baseline in all logs in terms of f-score and recall. Ours achieve a one f-score and recall in most of event logs, while the baseline merely scores 0. Besides, ours are more reliable regarding to consistency. The reason behinds is that most of the numerical data values are unique because the three activities are different medical tests in the hospital. It leads to unsatisfied results for the baseline since it tries to use PDF to compare data distributions. While our approach clusters data value first and makes histograms to adopt EMD for comparisons. Thus, we reduce the adverse effects of too much distinct data values.

5.3 Overall Results

In this section, we compare the overall results obtained from ours and baseline approach. For our approach, we subjectively adopt 0.25 and 0.1 for each perspective as they have performed well in the data. The decision-making mechanism is: activity labels can be regarded as redundant if they are similar in any perspective. Since the baseline approach requires domain knowledge to determine the best weight setting, we have no way to obtain, so we adopt uniform weight for each dimension. The strategies in figure 4 are followed again to generate logs using the Sepsis log. The ground truth not only contains activity labels which we manually renamed, also consists of any pair of these three activities, which are “”, “” and “”, as they are different variants of discharging a patient [17]. We conduct two experiments on the baseline approach. One uses the fundamental information from event logs (i.e. control-flow relations, timestamps and data values) and the other one uses all information (i.e. resources included). Because it is not realistic to assume that all real-life logs have resource attributes. In total, 35 different settings are used (i.e. from to ), which lead to 350 logs. The results are shown in figure 8 and 9. It is clear that our approach performs much better when utilising the fundamental information. For instance, the average f-score and recall are 0.64 and 0.77 separately, while the baseline are only 0.22 and 0.16. Moreover, even the baseline incorporates more information from event logs; our approach still wins in both f-score and recall (i.e. the average f-score and recall of the baseline are 0.44 and 0.43). Besides, when the redundant activity labels are less frequent, our approach can still successfully detect most of positive classes while the baseline suffers. The f-score range of our approach is only around 0.1 while the baseline is 0.4, which further illustrates that our approach is superior to the baseline in terms of consistency and reliability.

6 Case Study

We also conduct a real-life case study using the publicly available MIMIC-III444https://mimic.physionet.org/ data set to demonstrate that our approach can be used in real information systems. The MIMIC-III data set comprises health-related data associated with over forty thousand patients who stayed in critical care units of a US hospital between 2001 and 2012 [11].

6.1 Data Extraction

It has been found redundant information as stated in the MIMIC-III official documentation555https://mimic.physionet.org/mimictables/d_items/ in the form of observation activities due to the free-text nature of data entry in the CareVue system. We follow the US observation chart in [6] (urine output, consciousness and pain are excluded) to extract observation data for male diabetes patients (ICD-9 codes [1] are 250.0x -250.9x). We notice that there are multiple activity labels representing blood pressure, we follow [31] in selection. Eight different observation activities are noticed for blood pressure, four for systolic and four for diastolic pressure. Table 2 shows the observation activities we used in the case study and their corresponding IDs in the MIMIC-III data set. We also include other activities such as admission (e.g. ED registration, ED out, etc.) and callout (e.g. create callout, update callout, etc.) following [15] to make the process more complete. The SQL code666https://github.com/GilbertFan/Redundant-Activity-Detection for data extraction is publicly available on GitHub. The details of the log are shown in table 1, where a complex (i.e. 22 activities and over a half-million events) with high variants (i.e. not a single trace is the same) log is extracted.

ID Observation Activity
618 Respiratory Rate
50815 Flow Rate
50817 Saturation
51 Arterial BP [Systolic]
442 Manual BP [Systolic]
455 NBP [Systolic]
6701 Arterial BP #2 [Systolic]
8368 Arterial BP [Diastolic]
8440 Manual BP [Diastolic]
8441 NBP [Diastolic]
8555 Arterial BP #2 [Diastolic]
211 Heart Rate
676 Temperature (C)
Table 2: Observation activities summary with their IDs in the MIMIC-III

6.2 Result and Discussion

In this case study, the thresholds are set to 0.2 for the control-flow perspective and 0.1 for the data value perspective. Since we would like the results to be as rigorous as possible, we adopt the decision-making mechanism to be: activity labels are redundant iff they are similar in all perspectives. We have found two pairs of redundant activity labels, which are observation activities “Arterial BP [Systolic]” and “NBP [Systolic]”, “Arterial BP [Diastolic]” and “NBP [Diastolic]”. These two pairs are corresponding in terms of systolic and diastolic blood pressure. They are all common blood pressure measurements in the ICU [16], which suggests they are likely candidates to be redundant. Clinically significant discrepancies exist between arterial and manual blood pressure [20], so these two labels should be treated as different (i.e. non redundant), that is consistent to our findings. For “Arterial BP” and “Arterial BP # 2”, there could be of some differences in the clinical interpretations, hence both are found to be low in the control-flow similarity measure. Further investigation may needed.

7 Conclusion

This paper proposes a novel approach to accurately detect redundant activity labels to produce more representative logs for process mining. The method can deal with redundant labels from logs that are integrated from different data sources or generated from the system with the free-text nature. By detecting redundant activity labels, more representative logs are produced. Thus, the discovered process model is more intelligible for further improvements. In comparing to existing work, our method provides the following value-add: 1). The detection accuracy is high among all logs with different frequency levels, which demonstrates the method is reliable and consistent. The high recall indicates the method can detect most of positive classes. 2). The method works fine for logs only have the fundamental information, i.e. control-flow relations or numerical data values. 3). A decision-making mechanism is applied instead of weight settings from domain experts.

As demonstrated in our case study using the MIMIC III data set, two pairs of redundant blood pressure observations are successfully detected, which further demonstrates the utilization of our approach in reality, especially in the healthcare domain. The approach can be extended to include more perspectives (e.g. resource attributes if certain information exists in event logs). Our future work would incorporate NLP (natural language processing) to automatically repair redundant activity labels by preserving the same contexts and categorizing differences into closest synonyms.


  • [1] Icd - icd-9-cm - international classification of diseases, ninth revision, clinical modification (Nov 2015), https://www.cdc.gov/nchs/icd/icd9cm.htm
  • [2] van der Aa, H., Gal, A., Leopold, H., Reijers, H.A., Sagi, T., Shraga, R.: Instance-based process matching using event-log information. In: International Conference on Advanced Information Systems Engineering. pp. 283–297. Springer (2017)
  • [3] Assent, I., Wenning, A., Seidl, T.: Approximation techniques for indexing the earth mover’s distance in multimedia databases. In: 22nd International Conference on Data Engineering (ICDE’06). pp. 11–11. IEEE (2006)
  • [4] Bose, R.J.C., Mans, R.S., van der Aalst, W.M.: Wanna improve process mining results? In: 2013 IEEE symposium on computational intelligence and data mining (CIDM). pp. 127–134. IEEE (2013)
  • [5] Brockhoff, T., Uysal, M.S., van der Aalst, W.M.: Time-aware concept drift detection using the earth mover’s distance. In: 2020 2nd International Conference on Process Mining (ICPM). pp. 33–40. IEEE (2020)
  • [6] Cornish, L., Hill, A., Horswill, M.S., Becker, S.I., Watson, M.O.: Eye-tracking reveals how observation chart design features affect the detection of patient deterioration: An experimental study. Applied ergonomics 75, 230–242 (2019)
  • [7] Dijkman, R., Dumas, M., Van Dongen, B., Käärik, R., Mendling, J.: Similarity of business process models: Metrics and evaluation. Information Systems 36(2), 498–516 (2011)
  • [8] Fox, F., Aggarwal, V.R., Whelton, H., Johnson, O.: A data quality framework for process mining of electronic health record data. In: 2018 IEEE International Conference on Healthcare Informatics (ICHI). pp. 12–21. IEEE (2018)
  • [9] Guo, Q., Wen, L., Wang, J., Yan, Z., Philip, S.Y.: Mining invisible tasks in non-free-choice constructs. In: International Conference on Business Process Management. pp. 109–125. Springer (2016)
  • [10]

    Jain, A., Nandakumar, K., Ross, A.: Score normalization in multimodal biometric systems. Pattern recognition

    38(12), 2270–2285 (2005)
  • [11] Johnson, A.E., Pollard, T.J., Shen, L., Li-Wei, H.L., Feng, M., Ghassemi, M., Moody, B., Szolovits, P., Celi, L.A., Mark, R.G.: Mimic-iii, a freely accessible critical care database. Scientific data 3(1),  1–9 (2016)
  • [12] Johnson, S.C.: Hierarchical clustering schemes. Psychometrika 32(3), 241–254 (1967)
  • [13] Klinkmüller, C., Weber, I., Mendling, J., Leopold, H., Ludwig, A.: Increasing recall of process model matching by improved activity label matching. In: Business process management, pp. 211–218. Springer (2013)
  • [14] Koschmider, A., Ullrich, M., Heine, A., Oberweis, A.: Revising the vocabulary of business process element labels. In: International Conference on Advanced Information Systems Engineering. pp. 69–83. Springer (2015)
  • [15] Kurniati, A.P., Hall, G., Hogg, D., Johnson, O.: Process mining in oncology using the mimic-iii dataset. In: Journal of Physics: Conference Series. vol. 971, p. 012008. IOP Publishing (2018)
  • [16] Li-wei, H.L., Saeed, M., Talmor, D., Mark, R., Malhotra, A.: Methods of blood pressure measurement in the icu. Critical care medicine 41(1),  34 (2013)
  • [17] Mannhardt, F., Blinde, D.: Analyzing the trajectories of patients with sepsis using process mining. In: RADAR+ EMISA@ CAiSE. pp. 72–80 (2017)
  • [18] Mans, R.S., van der Aalst, W.M., Vanwersch, R.J., Moleman, A.J.: Process mining in healthcare: Data challenges when answering frequently posed questions. In: Process Support and Knowledge Representation in Health Care, pp. 140–153. Springer (2012)
  • [19] Mendling, J., Reijers, H.A., Recker, J.: Activity labeling in process modeling: Empirical insights and recommendations. Information Systems 35(4), 467–482 (2010)
  • [20] Mirdamadi, A., Etebari, M.: Comparison of manual versus automated blood pressure measurement in intensive care unit, coronary care unit, and emergency room. ARYA atherosclerosis 13(1),  29 (2017)
  • [21] Richter, F., Zellner, L., Azaiz, I., Winkel, D., Seidl, T.: Liproma: label-independent process matching. In: International Conference on Business Process Management. pp. 186–198. Springer (2019)
  • [22]

    Rubner, Y., Tomasi, C., Guibas, L.J.: The earth mover’s distance as a metric for image retrieval. International journal of computer vision

    40(2), 99–121 (2000)
  • [23] Sadeghianasl, S., ter Hofstede, A.H., Suriadi, S., Turkay, S.: Collaborative and interactive detection and repair of activity labels in process event logs. In: 2020 2nd International Conference on Process Mining (ICPM). pp. 41–48. IEEE (2020)
  • [24] Sadeghianasl, S., ter Hofstede, A.H., Wynn, M.T., Suriadi, S.: A contextual approach to detecting synonymous and polluted activity labels in process event logs. In: OTM Confederated International Conferences” On the Move to Meaningful Internet Systems”. pp. 76–94. Springer (2019)
  • [25] Simon, H.A.: The new science of management decision. (1960)
  • [26] Sturges, H.A.: The choice of a class interval. Journal of the american statistical association 21(153), 65–66 (1926)
  • [27] Suriadi, S., Andrews, R., ter Hofstede, A.H., Wynn, M.T.: Event log imperfection patterns for process mining: Towards a systematic approach to cleaning event logs. Information Systems 64, 132–150 (2017)
  • [28] Van Der Aalst, W.: Process mining: Overview and opportunities. ACM Transactions on Management Information Systems (TMIS) 3(2), 1–17 (2012)
  • [29] Van Der Aalst, W.: Data science in action. In: Process mining, pp. 3–23. Springer (2016)
  • [30] Van Der Aalst, W., Adriansyah, A., De Medeiros, A.K.A., Arcieri, F., Baier, T., Blickle, T., Bose, J.C., Van Den Brand, P., Brandtjen, R., Buijs, J., et al.: Process mining manifesto. In: International Conference on Business Process Management. pp. 169–194. Springer (2011)
  • [31] Wei, M.C., Kornelius, E., Chou, Y.H., Yang, Y.S., Huang, J.Y., Huang, C.N.: Optimal initial blood pressure in intensive care unit patients with non-traumatic intracranial hemorrhage. International journal of environmental research and public health 17(10),  3436 (2020)
  • [32]

    Weijters, A., Ribeiro, J.: Flexible heuristics miner (fhm). In: 2011 IEEE symposium on computational intelligence and data mining (CIDM). pp. 310–317. IEEE (2011)

  • [33]

    Zhang, M., Liu, Y., Luan, H., Sun, M., Izuha, T., Hao, J.: Building earth mover’s distance on bilingual word embeddings for machine translation. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 30 (2016)