Evaluation of Trace Alignment Quality and its Application in Medical Process Mining

Trace alignment algorithms have been used in process mining for discovering the consensus treatment procedures and process deviations. Different alignment algorithms, however, may produce very different results. No widely-adopted method exists for evaluating the results of trace alignment. Existing reference-free evaluation methods cannot adequately and comprehensively assess the alignment quality. We analyzed and compared the existing evaluation methods, identifying their limitations, and introduced improvements in two reference-free evaluation methods. Our approach assesses the alignment result globally instead of locally, and therefore helps the algorithm to optimize overall alignment quality. We also introduced a novel metric to measure the alignment complexity, which can be used as a constraint on alignment algorithm optimization. We tested our evaluation methods on a trauma resuscitation dataset and provided the medical explanation of the activities and patterns identified as deviations using our proposed evaluation methods.


page 1

page 9


Process-oriented Iterative Multiple Alignment for Medical Process Mining

Adapted from biological sequence alignment, trace alignment is a process...

Probabilistic Trace Alignment

Alignments provide sophisticated diagnostics that pinpoint deviations in...

Rethinking Evaluation Methodology for Audio-to-Score Alignment

This paper offers a precise, formal definition of an audio-to-score alig...

Alignment- and reference-free phylogenomics with colored de-Bruijn graphs

We present a new whole-genome based approach to infer large-scale phylog...

The Quality of the 2020 Census: An Independent Assessment of Census Bureau Activities Critical to Data Quality

This report summarizes major findings from an independent evaluation of ...

On Geometric Alignment in Low Doubling Dimension

In real-world, many problems can be formulated as the alignment between ...

An Extensible, Scalable Spark Platform for Alignment-free Genomic Analysis – Version 1

Alignment-free similarity/distance functions, a computationally convenie...

I Introduction

I-a Motivation

Process mining has proven useful in the various domains, including clinical and administrative processes in health care [1]. A medical process mining application is the modeling of chest pain treatments and healthcare delivery [2]. Trace alignment is an algorithm used in process mining to discover common work patterns and deviations from the established practice. Trace alignment takes as input sequences of activities performed during different process executions (i.e., “trace”), and finds for each activity finds the best match in different traces [3]. The alignment result is a matrix where the rows represent traces, and the same type activities are aligned in the same columns (Fig.1). The activities most commonly executed in a similar chronological order form a “consensus sequenc” [4][5][6]. The trace alignment algorithm originates from bioinformatics, where it is used to align protein and gene sequences to identify common structures and mutations. The parameters for alignment algorithms, however, are determined subjectively or based on expert assumption instead of automatic adjustment [7]. Proper evaluation of the alignment result is therefore essential for generating an accurate alignment [8], and the evaluation results can help optimize the alignment algorithm parameters.

The alignment accuracy measures how many activities and patterns in the alignment are aligned correctly, which determines the alignment’s ability to extract useful knowledge and insights about the process [3]. The correct alignment of activities and patterns means whether the aligned activities or pairs (1) are the same as in the reference or (2) satisfy certain preset constraints. The alignment accuracy could be measured from different aspects: whether the activities and patterns are accurately aligned, whether the consensus activities and potential deviations can be identified in the alignment result, and whether dissimilar activities are misaligned in the same column.

Currently, methods based on the sum-of-pairs score are the most commonly used for accuracy evaluation [9][10]. While the reference-based sum-of-pairs score approach [9] requires having a reference alignment to calculate the number of aligned pairs in the alignment result, the reference-free sum-of-pairs score approach [10] does not need the reference alignment but assumes that only the activities of the same type can be aligned. Accuracy evaluation methods do not perform well for more complex patterns in the alignment instead of activity pairs [8], particularly in scenarios where the reference alignment (ground truth) is unavailable.

Fig. 1: An example of trace alignment. Activities of the same type are aligned in the same column, and each row represents an individual case formed by a trace of activities in chronological order
Current Evaluation Methods Reference Required Attributes Limitation
Reference-based sum-of-pairs score [9] Yes Activities type /
Reference-free sum-of-pairs score [10] No Activities type /
Column score [11] Yes Activities type Extremely sensitive to alignment errors
Misalignment score [12] No Activities type, patterns of activities High dependency on the pattern chosen
Information score [12] No Activities type, activities frequency Limited to a single column
TABLE I: Summary of the current evaluation methods

Misalignment score is a reference-free method for evaluating alignment accuracy [12]. It measures alignment quality by checking if certain patterns are aligned. Other evaluation methods for trace alignment are adopted from molecular sequence alignment, including column score and sum-of-pairs score [11]. However, biological sequences unlike process traces, are subject to known restrictions or contain known structures, and have reference alignments established by domain experts for evaluation [13][14][15]. In process mining, especially in the early stage of discovering common activity patterns and deviations, activity restrictions or structures are unknown, and establishing a reference alignment is laborious and requires domain expertise. In addition, as new traces are collected over time, continuous updating of the reference alignment is not practical. Because current trace alignment approaches focus on acquiring knowledge without a reference alignment, reference-free alignment evaluation methods are preferred [14].

Another metric of alignment quality is alignment confidence, which distinguishes the consensus activity from other activities in the column. The confidence can be measured by the information score which quantifies the information entropy of a single column based on the activity type and the frequency of non-empty elements of the column [12].

Current evaluation methods suffer from various limitations (TABLEI): the column score is sensitive to the alignment error; the misalignment score is determined by arbitrarily chosen patterns and does not reflect the overall alignment quality; and the information score only evaluates individual columns. To address these limitations, we modified the misalignment score and information score. In addition, we introduced a novel metric of alignment complexity, which measures the redundancy in the alignment based on the alignment’s length. We then integrated the existing and our new metrics for evaluating trace alignment including alignment accuracy, confidence, and complexity, which can reflect the overall quality of the alignment result without a reference alignment.

We validated our evaluation methods on the alignments of medical process data. Our method outperformed previous evaluation methods in terms of the correlation between the evaluation results and the number of alignment errors. In addition, we provided medical explanations of the process deviations identified by our new evaluation methods.

I-B Related Work

Two widely used alignment accuracy evaluation methods are sum-of-pairs score (SPS) and column score (CS) [15][10][9]. There are two versions of the sum-of-pairs score: reference-based [10] and reference-free [9]. Reference-based sum-of-pairs score compares the aligned activities with a reference alignment to calculate the number for correctly aligned activity pairs. Reference-free sum-of-pairs score considers two activities as correctly aligned if they are of the same type (match). Column score compares the columns in the alignment result to the reference alignment to check if the columns are correctly aligned.

Another method for evaluating alignment accuracy is misalignment score, which measures the distance between the incorrectly aligned instances of a pattern within the traces. The distance is defined as the number of columns between two activities that are supposed to be aligned in the same column. The details of the misalignment score will be discussed later. The misalignment score depends on the pattern chosen and is considered as a pattern-wise alignment accuracy [12].

Alignment confidence is another evaluation metric based on the information score

[12]. The current column-wise information score, however, does not quantify the overall information entropy in the whole alignment, which can be useful to evaluate the confidence of individual columns.

Previous studies do not consider the alignment complexity as an individual metric for evaluation [8][16]. Trace alignment does not change the type, number or order of the original activities—it only inserts an empty space in traces for which it cannot find a matching activity in a given column. The alignment complexity measures the number of excessive empty spaces that the alignment contains. The optimal alignment has a minimum needed number of empty spaces, while other alignments may have more empty spaces. Although the optimal alignment complexity does not guarantee optimal alignment result, it is better to avoid unnecessary empty spaces, which may cause problems for the column-wise information score, as discussed later. Therefore, it is necessary to consider and quantify alignment complexity as a separate metric.

The reference alignment may often be unavailable because acquiring it is labor intensive, requires domain knowledge, and is subject to human bias. Calculating the optimal trace alignment result may not be practical when the number of traces is large due to the computational complexity [12]. However, having a reference alignment is necessary to validate an evaluation method and quantify the number of errors in the alignment result. We use an optimum-approaching alignment method called M-COFFEE [17] to generate the consensus multi-trace alignment as our reference alignment.

I-C Contribution

The contributions of this paper are:

  • Enhancement of misalignment score and information score metrics, and a novel metric of alignment complexity: We introduced a novel metric of alignment complexity to quantify redundancies in alignment and modified the existing misalignment score and information score metrics to evaluate overall alignment accuracy and confidence. We showed that our methods outperform the previous methods on both synthetic and real-world process logs. We analyzed the influence of evaluation metrics on the overall quality of alignment and proposed a general evaluation procedure that can help identify accurate alignment and optimize the alignment algorithm.

  • Application of trace alignment evaluation to understand medical activities in context: We validated our proposed evaluation methods on data from a real-world medical process and extracted consensus sequence of activities. We obtained the accurate alignment based on our evaluation methods, analyzed process deviations in activities, and provided medical explanation within a case-study a medical process.

Ii Methodology

We first describe the trace alignment algorithm, followed by our three alignment metrics: accuracy, confidence, and complexity. We then present our procedure for alignment evaluation.

Ii-a Alignment Algorithm

Trace alignment algorithms include Needleman–Wunsch [18], Smith–Waterman [19] and Duration-Aware Trace Alignment [6]. Needleman-Wunsch and Duration-Aware Trace Alignment algorithms aim to find globally optimal alignment while Smith–Waterman aims to find locally optimal alignment. Globally optimal alignments align entire activity sequences, from the start until the end, while locally optimal alignment finds the optimal alignment for subprocesses of the process. Duration-Aware Trace Alignment considers the duration of activities, in addition to their sequential order, to generate a globally optimal alignment. We focused on globally optimal alignments because of the following reasons:

  • The need to model the whole process execution instead of a part of it: In real-world process logs, finding common patterns between entire traces of process execution is important to extract the workflow. Locally optimal alignments only align segments of the activity sequence, making them not suitable for workflow extraction.

  • Alignment shrinkage due to noise: Because the locally optimal algorithm aligns the similar subprocesses within process traces, the similar subprocesses are likely to become shorter with increasing number of traces, resulting in alignment shrinkage. In real-world processes, great flexibility and variability of process performance make the common segments very short or nonexistent. Because of the alignment shrinkage problem consensus activities across different process executions cannot be identified.

An important issues with globally optimal alignment, is a high time complexity: , where is the number of traces and is the average trace length [12]

. This problem is usually addressed by approximating the globally optimal alignment using a heuristic approach or progressive alignment construction

[20]. The heuristic approach builds a guide tree connecting each trace so that the alignment iteratively aligns the two closest traces or alignment profiles (intermediate alignment results) in the guide tree until all traces have been aligned.

An approximation of the globally optimal alignment using a heuristic approach may introduce deviations from the globally optimal alignment, which are called heuristic errors [4]. Heuristic errors appear randomly, depending on the guide tree built by the heuristic approach, generating different alignment results with different heuristics. Since the heuristic approach generates no other types of alignment error except for the heuristic errors, we used the number of heuristic errors as an indicator for measuring the alignment accuracy.

Ii-B Improved Metric: Accuracy

The alignment accuracy metric measures the number of activities correctly aligned and is directly correlated to the . Alignment accuracy evaluation methods include sum-of-pairs score (SPS), column score (CS), and misalignment score.

Ii-B1 Sum-of-pairs Score (SPS)

Reference-based sum-of-pairs score

The reference-based sum-of-pairs score, sometimes called quality score (Q score) [21], is the total number of correctly aligned activity pairs in the alignment result divided by the number of all aligned activity pairs in the reference alignment. Correct alignment means that the aligned activity pairs in alignment result are also aligned in the reference alignment. The reference-based sum-of-pairs score has two derivatives: developer score and modeler score [21]. The derivatives only differ in terms of choosing the reference alignment and are consistent with each other in the same process log.

Reference-free sum-of-pairs score

Reference-free sum-of-pairs score assumes that aligning activities of the same type is correct (“match”), aligning activities to different type activities is incorrect (“mismatch”) and aligning activities to the empty space is acceptable (“gap”) [10]. Mismatches are given a penalty while gap penalties vary across different scoring schemes. The reference-free sum-of-pairs score measures the similarity between traces within the alignment. The higher similarity indicates a better alignment quality. We adopted a commonly used scoring scheme: match = 1, mismatch = -1 and gap = 0 [10].

Ii-B2 Column Score (CS)

The column score is defined as the number of correctly aligned columns divided by the total number of columns in the alignment [11]. Here, “correctly aligned” means that the types and numbers of activities in an alignment column are exactly the same as that in the corresponding reference column.

Fig. 2: (a) Reference alignment (b) Alignment with 1 misalignment (c) Alignment with 2 misalignments (misaligned activities are shaded in gray)

Column score is highly sensitive to misaligned activities. Any activity in an alignment column different from that in the reference will make the column counted as misaligned. It does not distinguish the number of misaligned activities within the same column (Fig.2), making it a coarse measure of alignment accuracy.

Ii-B3 Misalignment Score

Alignment accuracy can also be quantified by the misalignment score. In a trace, consecutive activities with a specific order form a certain pattern, and the misalignment score considers the sequential order of these activities in the pattern. For a specific pattern, misalignment score measures the similarity between traces by checking if the patterns are aligned in the same columns. If not, misalignment score measures the distance between each column the patterns are in and sums the distances. Misalignment scores of traces form a matrix containing pairwise misalignment scores [12]:


where denotes the larger number of pattern repetitions between trace , ; is the mapping set of pattern instances in and ; is the score of pattern instance in (if the instance is aligned to a gap or other activity that is not in the pattern, then ; otherwise ) [12].

Note that the pattern considered in the misalignment score does not include the gap symbol “-”. The higher misalignment score indicates less similarity between traces, meaning that the traces are considered misaligned. The misalignment score adds up all pairwise misalignment scores:


Because the misalignment score measures the degree of misalignment with respect to a certain pattern in the alignment, the crucial parameter is the pattern. However, the choice of the pattern has not been established. The problem exists in situations where a single pattern may not reflect the misalignments in the whole alignment. In addition, the existing misalignment score metric does not consider the pattern’s frequency, which also influences the quality. To achieve a more comprehensive view of misalignments, we modified the method for selecting the pattern and defined an overall misalignment score based on the patterns in the alignment.

Ii-B4 A Novel Overall Misalignment Score (OMS)

Frequency is one of the most intuitive ways to measure representativeness. As mentioned, misalignment score depends on the pattern chosen, which is supposed to be representative for the whole alignment. Given that patterns vary in length and have a different contribution to the alignment accuracy: longer patterns tend to be much rarer than the shorter patterns and longer patterns are more likely to be misaligned since they require aligning more activities. The pattern’s length and frequency have a characteristic distribution (Fig.3): the longer patterns (with a length greater than 5) are much less frequent than the shorter patterns. Hence, when measuring the misalignments of patterns in the whole alignment, distinguishing the patterns based on their frequencies is intuitively more precise to depict the distribution of misalignments than using the most frequent pattern only.

Fig. 3: Pattern count and length distribution over frequency of a synthetic log data. The synthetic logs are generated using PLG [22] based on 5 simplified medical process models established by medical experts. The X-axis shows the percentage interval of each pattern’s frequency over the upper bound (), Y-axis is the count of patterns. Patterns with different length are marked in differently bars and corresponding length is in the legend.

We chose the patterns that occur more frequently than a threshold for calculating misalignment score. Each pattern’s misalignment score is assigned a weight based on the ratio of pattern’s frequency over the frequency of the most frequent pattern. In this weighting method, a pattern of a higher frequency has a larger influence in the misalignment score. Our overall misalignment score () is:


where is the occurrence of the pattern , is the original misalignment score for pattern , is the maximum occurrence and is the number of patterns. This overall misalignment score considers all eligible patterns, and the alignment accuracy mainly depends on shorter but frequent patterns, instead of longer but infrequent ones.

Log % ’s correlation to
1 10 2 20.00 0.7048
4 40.00 0.9834
6 60.00 0.9650
8 80.00 0.8934
10 100.00 0.8695
2 27 5 18.52 0.8380
11 40.74 0.9539
16 59.26 0.9023
22 81.48 0.7375
27 100.00 0.6193
3 32 7 21.88 0.6371
14 43.75 0.9778
20 62.50 0.8762
26 81.25 0.8506
32 100.00 0.7790
4 57 14 24.56 0.7159
26 45.61 0.7632
36 63.16 0.7711
47 82.46 0.7172
57 100.00 0.6850
5 164 33 20.12 0.7946
65 39.63 0.8768
98 59.76 0.8319
131 79.88 0.8420
164 100.00 0.8152
The setting of highest correlation for each log is in bold
TABLE II: Overall misalignment score’s correlation to with different frequency threshold settings
1 0.9834 0.9702
2 0.9539 0.9025
3 0.9778 0.9457
4 0.7632 0.6903
5 0.8768 0.7308
Based on the most frequent pattern in the alignment
TABLE III: Correlation between misalignment scores and the number of heuristic errors in different logs

Since frequency threshold affects the evaluation method’s ability to evaluate misalignments, we would like to maximize the correlation between the overall misalignment score and the number of errors by choosing an appropriate . We determined the correlation between and with different settings on different synthetic logs (TABLEII), and found that makes the highest correlation to except for the log (TABLEII) (although of in the log comes close to the highest correlation of 0.7711). Thus, we recommend using the frequency threshold as 40% of to choose patterns for the overall misalignment score calculation.

To compare our metric 3 to the original misalignment score 2, we analyzed the correlation of the and to on the same synthetic data sets (TABLEIII). We generated 30 alignments with different of each log. The results showed that has a higher correlation to , indicating our performed better in evaluating the overall alignment accuracy than the original misalignment score.

The time complexity of calculating the overall misalignment score is , where extracting patterns takes time, and calculating misalignment score for each pattern takes time; is the number of traces and is the longest trace length.

Ii-C Improved Metric: Confidence

Fig. 4: (a) Reference alignment (b) Preferred alignment by information score

Ii-C1 Information Score

As mentioned, a single column’s confidence in the alignment result is measured by information score. Information score is a quantification method based on information entropy considering activity types and frequencies. Each type of activity has a frequency and its information entropy can be calculated based on its frequency [12]:


where is the column’s information score, E is the column’s information entropy and is the occurrence frequency of each type’s activity in the column; is the maximum entropy of the whole alignment which equals , is the number of activity types ( for the gap activity).

The purpose of trace alignment is to discover consensus activities and to detect deviations, and a high alignment confidence helps in achieving the purpose. According to (4) and (5), if all types of activities (including gap) occur in the column with same frequency, the information score will reach the minimum value of 0, meaning low confidence of the column; if the column is filled with only one type of activity, the information score will reach the maximum value of 1, indicating high confidence.

Problems arise when considering information score for columns only: To obtain higher information score, the alignment tends to (1) split columns if there are two or more activities in the column and the gaps cannot be replaced by activities in other columns; or (2) align activities incorrectly in one column if the gaps in this column can be replaced by shifting other activities.

In situation (1), the alignment tends to split activities that are already aligned. In this case, each column’s information score will be higher but the alignment is unreasonably longer. If the target alignment (a) has 50% of activity B in column 2 and 3, the preferred alignment of information score (b) will align each B in individual columns 2, 3, 4 and 5 (Fig.4). In situation (2), the alignment algorithm tends to shorten the alignment by aligning incorrect activities in a column (Fig.4). Thus, if information score is applied to evaluate alignment algorithm column-by-column, the algorithm tends to split or merge columns incorrectly.

Ii-C2 Proposed Overall Information Score (OIS)

To address the problems with column-wise information score, we added a constraint on the cumulative information entropy for all columns:


where is the column index, is the alignment length. Then we propose our overall information score () as:


With the constraint, overall information score will not keep increasing monotonously with the increasing of columns, if the gain in the information entropy of the splitting column does not exceed the loss of average information entropy of the whole alignment, the splitting column operation is considered unnecessary and will not be proceeded.

The overall information score reduces the unnecessary column-splitting problem of evaluating the alignment confidence using column-wise information score and reflects the overall confidence in the whole alignment.

Fig. 5: (a) Absolute lower bound (0.25) and (b) Absolute upper bound (0.67) of gaps number (18 of ”-”) , and
Methods Metric Attributes Considered Score Range Monotony
Reference-free sum-of-pairs score Accuracy Activities type when
Reference-based sum-of-pairs score Accuracy Activities type when
Column score Accuracy Activities type when
Modified Misalignment score Accuracy Patterns type, frequency when
Modified Information score Confidence Activities type, frequency when information amount
Alignment complexity Complexity Activities frequency, alignment length when alignment length
The upper bound for reference-free sum-of-pairs score depends on the data set, and does not have a fixed value
TABLE IV: Attributes considered for evaluating alignment and their monotony

Ii-D Novel Metric: Complexity

In current evaluation methods, the length is not considered as an individual metric [14]; however, based on the previous discussion, we found that alignment length has a significant influence on alignment evaluation methods, including column-wise information score and column score. Alignment length also reflects the computational complexity needed to perform alignment: longer alignments contain more places for filling activities, resulting in higher computational complexity. In this section, we propose our new metric of alignment complexity based on the alignment length and the number of activities.

The alignment complexity can be calculated by the percentage of gaps: some gaps in the alignment are unnecessary, and so the percentage of gaps reflects the degree of redundancy. Yet not all the gaps are unnecessary since aligning activities requires gaps to place activities that cannot be aligned.

Note that there is a minimum number of gaps required to accomplish an alignment. Since traces in the alignment are of the same length, the differences in original traces’ length will be filled by gaps. The number of gaps required to fill the traces to the maximum length is the lower bound of alignment complexity. For example, for three traces , and (Fig.5), the minimum number of gaps needed is 3. In this example, no extra gaps are added with all columns correctly aligned (Fig.5 (a)).

There is also an upper bound of alignment complexity, which occurs when every column in the alignment contains only one activity from the original trace log (Fig.5 (b)). Thus, the alignment complexity of an alignment can be written as:


is the number of activities in the original trace log, is the number of traces, is the alignment length and is the shortest length of alignment, which is also the longest original trace’s length.

The lower alignment complexity means alignment has less redundancy. However, the optimal alignment does not guarantee the lowest alignment complexity. Due to this limitation of alignment’s complexity, it should be considered with the lowest priority when combined with other methods.

Ii-E General Evaluation Procedure

With the different trace alignment quality metrics, the attributes considered in each evaluation method and their monotony are summarized in TABLEIV. When evaluating an alignment with a reference alignment, higher reference-based sum-of-pairs score means the alignment is closer to the reference alignment, and lower misalignment score indicates the alignment has a lower degree of misalignment in the alignment’s patterns; a higher information score means more confidence in the common patterns and deviations found in the alignment, and a lower alignment complexity is expected for lower redundancy. The reference-free evaluation does not use the reference-based sum-of-pairs score or column score, while the other evaluation methods are the same with the reference-based evaluation.

Data set Ref-free SPS Ref-based SPS MS OMS OIS Column Score Alignment Complexity
Primary Survey 61 1478 0.4833 5.3097 5.0424 0.5793 0.3902 0.8203
52 1422 0.4493 5.2909 5.0242 0.5837 0.4186 0.8287
43 1669 0.5769 5.2776 4.9818 0.5902 0.4103 0.8112
42 1615 0.5459 5.2685 4.9758 0.5949 0.4146 0.8204
36 1684 0.5803 5.2358 4.9455 0.5794 0.4324 0.8010
25 1810 0.6443 4.7818 4.4182 0.5676 0.4688 0.7699
20 1932 0.7011 4.7076 4.3576 0.5789 0.5161 0.7625
19 1814 0.6298 4.0961 3.9576 0.5729 0.3939 0.7769
13 1689 0.5798 3.7697 3.6727 0.5663 0.2857 0.7896
5 1782 0.5958 3.7494 3.6000 0.5735 0.3235 0.7834
0 (Ref) 2297 1.0000 4.5924 4.3091 0.6545 1.0000 0.7834
Correlation to / -0.8188 -0.7346 0.8187 0.8547 -0.2594 -0.3805 0.7875
The best results are marked in bold, and the alignment with 0 is used as the reference alignment
The pattern chosen for original misalignment score is the most frequent pattern in the alignment regardless of the pattern’s length
TABLE V: Evaluation methods correlation with the number of heuristic error in alignments

Currently, there is no general procedure for trace alignment evaluation [23]. To evaluate the overall alignment quality, evaluation metrics should be standardized. Alignment accuracy, confidence, and our proposed complexity metrics are used to evaluate the overall quality of the alignment results. Though measured independently, they are not strictly orthogonal or independent of each other because:

  • The attributes taken into consideration are not exclusive or inclusive. Accuracy, confidence, and complexity involve different attributes including activities types, frequency, patterns types and alignment length. These attributes are not orthogonal in the feature space, though they have a certain degree of overlap, i.e., patterns frequency is correlated to the frequency of each activity in the pattern, which makes overall misalignment score partially related to overall information score.

  • Though some correlation exists between the evaluation metrics, the correlation cannot be quantified generally since it is data-specific. The correlation between overall misalignment score and information score depends on the frequent activities and patterns in the data; the activities and patterns differ between data sets and may result in correlation coefficients ranging from -1 to 1.

Although they may overlap, each evaluation metric describes a unique aspect of the alignment, and should be considered in this order:

Firstly, alignment accuracy is the most important metric, as it is directly related to the number of alignment errors. Since the column score only provides a coarse evaluation and is sensitive to deviations, it is considered less significant when other accuracy methods are available.

Then the alignment confidence is measured by overall information score. A high overall information score is expected for alignment with strong confidence in patterns and deviations extracted. The overall information score will not increase if the alignment tries to split aligned columns.

Alignment complexity is considered with the lowest priority because it does not contribute to the consensus activities found or the amount of information in the alignment. However, alignment complexity can be used to avoid unnecessary computational complexity and reduce redundancies in the alignment.

Iii Experiment and analysis

Iii-a Experiment Design

Needleman–Wunsch and Duration-Aware Trace Alignment algorithms generate different alignments. A reference alignment was set with based on the consensus multi-trace alignment using M-COFFEE. We detected the number of heuristic errors . The same log using different parameters generated alignments with different . Using these, we showed the correlation between evaluation results and the number of heuristic errors. The high absolute value of correlation indicates the evaluation method truly reflects the alignment accuracy. Though alignment confidence and complexity are not designed for measuring alignment accuracy, their correlation was computed to check if they have overlap with alignment accuracy evaluation methods.

The log in this experiment includes 33 individual trauma resuscitation cases with 247 activities of 14 types. The data set was collected from August to December 2014. The log was coded by medical experts reviewing videos of trauma resuscitation. The use of this data set and its related research has been approved by the IRB of the Children’s National Medical Center.

Iii-B Results

Iii-B1 Alignment Evaluation

The evaluation results show the correlation between different evaluation methods and the number of error (TABLEV). The overall misalignment score has the highest correlation to the number of heuristic errors, while reference-free sum-of-pairs score, reference-based sum-of-pairs score, misalignment score and alignment complexity also achieve relatively high correlation. This shows our overall misalignment score better reflects the alignment accuracy than the original misalignment score and sum-of-pairs score. Our proposed alignment complexity also has a high correlation to the here because correcting heuristic error often results in shorter alignments, and decreases the alignment complexity. However, this is not always guaranteed. If the heuristic error is not the only activity in the column, correcting the error may not decrease the alignment length.

The reference alignment performs best in reference-free sum-of-pairs score, reference-free sum-of-pairs score, overall information score, and column score, while the alignment with 5 heuristic errors performs best in misalignment score and overall misalignment score, and the alignment with 20 heuristic errrs performs best in alignment complexity. The alignment with 5 heuristic errors aligns frequent patterns correctly, but misaligned some patterns with lower frequency; the reference alignment aligns most long patterns correctly except some patterns of length 2. Since the overall misalignment score considers pattern frequency, and shorter patterns are probably more frequent, the reference alignment here does not perform best in the misalignment score. The alignment with 20 heuristic errors has a short alignment length, but the low complexity comes at the price of aligning activities evenly among columns instead of aligning them to the most other activities. This confirms our assumption that alignment complexity should not be a prioritized metric. However, lower alignment complexity is still preferable when other metrics are the same, since it reduces the computation complexity and redundancy in the alignment.

Fig. 6: (a) Alignment with 5 heuristic errors labeled in A, B, C, D and E (b)Reference alignment in TABLEV, deviations labeled from 1 to 12. See digital version for color and higher resolution

The evaluation methods indicate the alignment with 5 heuristic errors has the best performance in the overall misalignment score and relatively good performance in the sum-of-pairs score and alignment complexity. The alignment with 5 heuristic errors Fig.6(a) have aligned activities to the majority activities, while activity is aligned in a single column as a deviation. The alignment with 5 heuristic errors has lower overall misalignment score than the reference alignment Fig.6(b), which aligns activities to the deviation activities in cases 1 and 2, and aligns activity to the majority. The reference alignment has a better sum-of-pairs score, overall information score and alignment complexity.

Iii-B2 Medical Explanation

The reference alignment attempts to align activities to those activities in cases 1 and 2. Cases 1 and 2 are similar because the patient arrived before the complete medical team could assemble. In both cases, someone other than the examining provider began the assessment, and upon arrival, the assigned examining provider started the exam over from the beginning, producing the repeated activity sequences shown in cases 1 and 2. In 10 the full medical team had assembled before the patient’s arrival, and the examining provider was the only person to conduct the examination. Although this alignment produced a unique deviation of due to a repeated chest auscultation, a review of the data showed that the rest of the activities were more consistent with the consensus sequence derived from this algorithm. Above all, the alignment with 5 heuristic errors is more clinically similar to the cases which it is aligned.

Our overall misalignment score evaluation matches the feedback from the medical team because the overall misalignment score considers the performance of all patterns in the alignment. Though the reference alignment has a better sum-of-pairs score, confidence and complexity, it comes at the price of misaligning the patterns and . In real-life situations, patterns are usually considered with priority since the alignment is context-based. With the high correlation between overall misalignment score and number of heuristic errors, we can use the overall misalignment score as the principle alignment accuracy evaluation method.

Besides the 5 heuristic errors in the alignment, we also analyzed the activities considered to be deviations in both of the alignments. These activities are labeled 1 to 12 and marked in black boxes Fig.6(b).

For deviations , these activities were performed by different roles during the trauma resuscitation. Since the alignment algorithm does not consider role information and aligns activities based on activity type and order only, these activities are aligned in single columns with no other optimal places.

Activity was a deviation since the person performing this activity took action earlier than expected, making this activity out of order; activities should be generally performed after Upper Extremity Pulses Assessment because Lower Extremity Pulses Assessment is checked at patient’s feet, and the examining provider typically proceeds from the head to the feet, yet occasionally Lower Extremity Pulses Assessment can be performed before; activities were deviations because the patient was not cooperative and medical team had to perform pupil assessment multiple times; activities were minor deviations due to the back and forth assessments performed by the medical team; activity in case 13 was skipped at first and then the medical team performed it after pupil assessment, and activity in case 16 was an unexpected repetition; activities were performed in later phases and thus not included in the primary survey phase, making the cases 20, 32 and 33 skip the pupil assessments, and these disappearances of pupil checks were considered deviations during the trauma resuscitation.

The good quality alignments are able to identify real-life deviations. Though some deviations like in Fig.6(b) are false alarms due to unconsidered role information, the overall evaluation methods still help to generate the alignment with a minimal number of heuristic errors.

Iv Conclusion

In this paper, we analyzed the previous alignment evaluation methods and proposed our modification to make them suitable for evaluating overall alignment quality. We discussed the limitations of previous evaluation methods through experiments on synthetic process log data, showing that column score is not appropriate for precise accuracy measurement. Experimental results showed our overall misalignment score and overall information score perform better in evaluating alignment quality than the previous ones. We verified our methods on a real-life medical process, showing our evaluation methods help identify deviations in the process. With the overall alignment quality evaluation methods, the alignment algorithm can be further optimized.


This research was supported by the National Library of Medicine of the National Institutes of Health under Award Number R01LM011834.


  • [1] E. Rojas, J. Munoz-Gama, M. Sepúlveda, and D. Capurro, “Process mining in healthcare: A literature review,” Journal of biomedical informatics, vol. 61, pp. 224–236, 2016.
  • [2] A. Partington, M. Wynn, S. Suriadi, C. Ouyang, and J. Karnon, “Process mining for clinical processes: a comparative analysis of four australian hospitals,” ACM Transactions on Management Information Systems (TMIS), vol. 5, no. 4, p. 19, 2015.
  • [3] A. Rozinat, A. A. De Medeiros, C. W. Günther, A. Weijters, and W. M. Van der Aalst, “Towards an evaluation framework for process mining algorithms,” BPM Center Report BPM-07-06, BPMcenter. org, vol. 123, p. 142, 2007.
  • [4] R. J. C. Bose and W. van der Aalst, “Trace alignment in process mining: opportunities for process diagnostics,” in International Conference on Business Process Management.   Springer, 2010, pp. 227–242.
  • [5]

    L. Bouarfa and J. Dankelman, “Workflow mining and outlier detection from clinical activity logs,”

    Journal of biomedical informatics, vol. 45, no. 6, pp. 1185–1190, 2012.
  • [6] S. Yang, M. Zhou, R. Webman, J. Yang, A. Sarcevic, I. Marsic, and R. S. Burd, “Duration-aware alignment of process traces,” in Industrial Conference on Data Mining.   Springer, 2016, pp. 379–393.
  • [7] L. L. Wu, “Some comments on “sequence analysis and optimal matching methods in sociology: Review and prospect”,” Sociological methods & research, vol. 29, no. 1, pp. 41–64, 2000.
  • [8] A. Rozinat, A. K. A. de Medeiros, C. W. Günther, A. Weijters, and W. M. van der Aalst, “The need for a process mining evaluation framework in research and practice,” in International Conference on Business Process Management.   Springer, 2007, pp. 84–89.
  • [9] G. H. Gonnet, C. Korostensky, and S. Benner, “Evaluation measures of multiple sequence alignments,” Journal of Computational Biology, vol. 7, no. 1-2, pp. 261–276, 2000.
  • [10] J. C. Setubal, J. Meidanis, and . . Setubal-Meidanis, Introduction to computational molecular biology.   PWS Pub., 1997.
  • [11] J. D. Thompson, F. Plewniak, and O. Poch, “A comprehensive comparison of multiple sequence alignment programs,” Nucleic acids research, vol. 27, no. 13, pp. 2682–2690, 1999.
  • [12] R. J. C. Bose and W. M. van der Aalst, “Process diagnostics using trace alignment: opportunities, issues, and challenges,” Information Systems, vol. 37, no. 2, pp. 117–141, 2012.
  • [13] R. C. Edgar and K. Sjölander, “A comparison of scoring functions for protein sequence profile alignment,” Bioinformatics, vol. 20, no. 8, pp. 1301–1308, 2004.
  • [14] J. De Weerdt, M. De Backer, J. Vanthienen, and B. Baesens, “A multi-dimensional quality assessment of state-of-the-art process discovery algorithms using real-life event logs,” Information Systems, vol. 37, no. 7, pp. 654–676, 2012.
  • [15] J. D. Thompson, F. Plewniak, and O. Poch, “Balibase: a benchmark alignment database for the evaluation of multiple alignment programs.” Bioinformatics, vol. 15, no. 1, pp. 87–88, 1999.
  • [16] C. Sander and R. Schneider, “Database of homology-derived protein structures and the structural meaning of sequence alignment,” Proteins: Structure, Function, and Bioinformatics, vol. 9, no. 1, pp. 56–68, 1991.
  • [17] I. M. Wallace, O. O’Sullivan, D. G. Higgins, and C. Notredame, “M-coffee: combining multiple sequence alignment methods with t-coffee,” Nucleic acids research, vol. 34, no. 6, pp. 1692–1699, 2006.
  • [18] S. B. Needleman and C. D. Wunsch, “A general method applicable to the search for similarities in the amino acid sequence of two proteins,” Journal of molecular biology, vol. 48, no. 3, pp. 443–453, 1970.
  • [19] T. F. Smith and M. S. Waterman, “Identification of common molecular subsequences,” Journal of molecular biology, vol. 147, no. 1, pp. 195–197, 1981.
  • [20] P. Hogeweg and B. Hesper, “The alignment of sets of sequences and the construction of phyletic trees: an integrated method,” Journal of molecular evolution, vol. 20, no. 2, pp. 175–186, 1984.
  • [21] J. M. Sauder, J. W. Arthur, and R. L. Dunbrack Jr, “Large-scale comparison of protein sequence alignment algorithms with structure alignments,” Proteins: Structure, Function, and Bioinformatics, vol. 40, no. 1, pp. 6–22, 2000.
  • [22] A. Burattin and A. Sperduti, “Plg: A framework for the generation of business process models and their execution logs.” in Business Process Management Workshops, vol. 66.   Springer, pp. 214–219.
  • [23] A. Adriansyah, J. Munoz-Gama, J. Carmona, B. F. van Dongen, and W. M. van der Aalst, “Alignment based precision checking,” in International Conference on Business Process Management.   Springer, 2012, pp. 137–149.