SUMMARIZED: Efficient Framework for Analyzing Multidimensional Process Traces under Edit-distance Constraint

05/02/2019 ∙ by Phuong Nguyen, et al. ∙ ibm University of Illinois at Urbana-Champaign 0

Domains such as scientific workflows and business processes exhibit data models with complex relationships between objects. This relationship is typically represented as sequences, where each data item is annotated with multi-dimensional attributes. There is a need to analyze this data for operational insights. For example, in business processes, users are interested in clustering process traces into smaller subsets to discover less complex process models. This requires expensive computation of similarity metrics between sequence-based data. Related work on dimension reduction and embedding methods do not take into account the multi-dimensional attributes of data, and do not address the interpretability of data in the embedding space (i.e., by favoring vector-based representation). In this work, we introduce Summarized, a framework for efficient analysis on sequence-based multi-dimensional data using intuitive and user-controlled summarizations. We introduce summarization schemes that provide tunable trade-offs between the quality and efficiency of analysis tasks and derive an error model for summary-based similarity under an edit-distance constraint. Evaluations using real-world datasets show the effectives of our framework.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Application domains, such as business processes and scientific workflows, exhibit data models in the form of multi-dimensional sequence of objects. For example, in business processes, given an underlying business process model represented as a directed acyclic graph of activities, the traces generated from the execution of the model are regarded as instances of the underlying model. Each trace consists of a sequence of activities sorted by time, where each activity in the trace appears in the process model and may be repeated111In this paper, we use trace, process trace, and sequence interchangeably to refer to an instance of a process.. Figure 2 shows an example of a loan application process model. The highlighted activities in the figure represent a possible execution trace of the model. In addition to the sequential structure, each activity also contains multi-dimensional attributes. For example, an activity in the loan application process can contain information about the person who performs the activity, the group to which she belongs, and the department responsible for the activity. In another domain, provenance data captured from the execution of scientific workflows are also in the form of multi-dimensional sequences. Figure 2 shows a sample trace of a semiconductor manufacturing process, where each activity can consist of additional information, such as the sector where the activity is performed and the person responsible for that activity.

Figure 1: Example 1: loan application process and a sample trace.
Figure 2: Example 2: sample trace of a scientific workflow.

With the popularity of such applications, there are increasing needs to analyze the data for operational insights, and there are many efforts to apply machine learning techniques in the business process management field 

[2, 6]. As business models mined from complete process traces are often complex and difficult to comprehend [7], users are interested in clustering process traces into smaller subsets and applying process discovery algorithms [1] on each subset. The models discovered using only the traces in a cluster tend to be both less complex and more accurate since there is less diversity among the traces within a cluster. In another example, scientists are interested in querying the provenance data of scientific workflow executions to look for previous executions of a workflow that are similar to the one in the query, again using models trained on historical executions.

Analyzing multi-dimensional sequence data poses a number of challenges. The first challenge is in terms of computational complexity of data analysis. For example, edit-distance is often used to capture the similarity between sequences [3]. Since edit-distance is quadratic to the sequence length and each sequence can consist of hundreds of data items (e.g., in business processes), it is computationally expensive to compute the similarities between sequences. This is especially challenging when dealing with large datasets and in applications such as traces clustering, where a lot of similarity computations need to be calculated. This complexity can also cause long application delays that affect interactive applications, such as similarity search, where users interact directly with the application and expect to get the results in a timely manner. The second challenge is to combine multi-dimensional attributes of data with the sequential structure between data objects into a unified approach. Edit-distance, for example, only concerns with counting the minimum number of basic operations required to transform one sequence of activities into the other, without considering the attributes of activities.

In this paper, we introduce Summarized

, a framework for the efficient analysis of multi-dimensional sequence data under edit-distance constraint. We focus on analysis tasks that are based on edit-distance similarity because it is a widely used measure for sequences. Instead of performing computationally expensive analysis on the original high-dimensional data, we transform the data into a summary space that has fewer dimensions, so that more efficient analysis can be applied. To incorporate multi-dimensional attributes of data items into the analysis, we introduce summarization schemes that allow users to select attributes as the summarization criteria. In addition to attribute-based summarizations, which produce summaries of good semantics but are limited in giving users control over the resolution of summaries (and thus, the efficiency), we also introduce topic-based summarization that enables the flexible trade-off between quality and efficiency of analysis tasks on summaries. In addition, we develop an error model for the edit-distance measure in the summary space to provide theoretical guarantees for the results of analysis tasks on summaries.

2 Related Work

There have been active research on subsequence mapping and sequence retrieval, especially with biological sequences data [25]. To support efficient mapping and retrieval, one of the common approaches is summarize original sequences using q-grams [4][19][15] and measure the similarity between two sets of q-grams. Another common approach is reference-based method. For example, [24][30] filter the results to a query sequence using precomputed distance between sequences in the database and a set reference sequences. DRESS [16] uses the most frequent codewords as references to identify a set candidate matches of a query. These work, however, do not consider sequences of multi-dimensional attributes of data items in the original sequences. In addition, both q-gram-based and reference-based methods do not preserve the sequential relationship between data items in the original sequences, and thus, do not support similarity measure under edit-distance constraint.

Another area of related work is on graph similarity and mining, where sequence is a special case. Since graph edit-distance is also very computationally expensive, most of the work try to transform the original graph to a more compact representation before measuring similarity. One common transformation approach is based on graph’ substructures, such as stars [33], trees [36], branches [20], paths [35], or k-shingle [22]

. Recently, there has been effort on solving graph edit-distance using binary linear programming

[18]. While some of these work provide error bound on graph edit-distance on substructure space, they only consider homogeneous graphs. In addition, a major issue with this group of related work is that graphs lose their interpretability and graphical representation after being transformed into substructure representation (i.e., graphs are either represented as bags-of-substructures [33][36], or numeric vector [22]).

In terms of the intuitive and interpretable summarization of graphs [21][14], Zhao et al. [34] introduce Graph Cube model that supports OLAP queries effectively on large multidimensional networks. Such a model can be used to produce interpretable summaries of original graph, in form of aggregate graphs, by performing cuboid queries. Tian et al. [28] introduce two database-style operations to summarize multi-dimensional graphs: one produces a summary by grouping nodes based on attributes and relationships, while the other allows users to control the resolutions of summaries. Chen et al. [5] show that random summaries can help effectively reduce the size of original graph and at the same time, are capable of mining frequent graph patterns. In this paper, besides using explicit attributes, we also leverage the implicit topics as summarization criteria. We also show that, different from general graphs, random summarization on sequences, although produces good effectiveness, suffers from efficiency.

Embedding methods [13][10][31] have been used to improve efficiency of similarity search on complex data (see [12] for an excellent survey). However, there are only a few embedding approaches that could guarantee important property of similarity measure on the embedding space, such as contractive property. For example, it might require the similarity measure between data on the embedding space is from a specific family of measure (e.g., Minkowski metric [13]). Another major drawback of embedding techniques is that they transform original sequences into vector-based representation, and thus, do not maintain the sequential relationship between data items on the new representation. In this paper, we formally define summarization schemes on sequences and show that contractive property of similarity measure on summary space under edit-distance constraint can be guaranteed.

There have been efforts to address the scalability issue in process mining and business process analysis [26]. However, these efforts mainly focus on process model discovery of large, complex traces [17][1][32]. There are also related work on using vector space-based dimensional reduction to improve the performance of traces clustering [27][23]. In this paper, we focus on improving efficiency of traces similarity search and traces clustering under edit-distance constraint. There is also related work to perform process discovery on large-scale dataset by using Map-Reduce [9]. Our work can be used in combination with the related efforts. For example, once traces are clustered into smaller subsets, efficient process discovery algorithms [11] can be applied to each subset.

3 Framework

Figure 3 highlights the motivation for designing the Summarized framework. We assume the existence of an original dataset,which consists of a set of process traces or logs of scientific workflow executions. Running an analysis, which would typically be computationally expensive due to the high-dimentionality of the data, provides results which are deemed as exact or “ground truth” answer.

Figure 3: Overview of Summarized’s approach.

The key principal of our framework is to transform the original data into a new space with fewer number of dimensions, thus avoiding the computationally expensive analysis on original dataset. The resulting output, is inherently different than the “ground truth”, is known as an approximate result. To demonstrate the practicality of our framework, we need to address the following two challenges: (1) How to generate summaries of data in a controlled and intuitive manner, and (2) How to relate the approximate results on summaries to the results on original data?

For the first challenge, many sequence- and graph-based (in which sequence is a special form) summarization methods generate summaries using statistics, patterns, or sub-structures of the data. Thus, the resulted summaries are often difficult to interpret by users as they lack the structural semantic connection with the original representation. The lack of structural semantics of summaries also prevents analysis tasks that rely on the structural information (such as edit distance-based analysis, whose results are easy-to-interpret by users) to be performed on summary space. Finally, under currently existing methods, users do not have much control over the summaries will be generated. As a result, it is difficult to integrate user expertise and feedback into the summarization process to guide the data analysis.

For the second challenge, since the analysis results on data summaries might not be the same to those on original data, it is important to understand the relationship between the two results and for all practical purposes, provide guarantees about the quality of results obtained from summaries.

To address the above mentioned challenges, in the remaining sections, we define sequential-order-preserving summarization on sequences and introduce several summarization schemes that are intuitive and give users more control over the resulted summaries. We also formally present an error model for summary-based similarity measure under edit-distance constraint and show that it provides quality guarantee over the results of similarity search task.

4 Definitions

This section provides basic definitions of the notion of multidimensional sequence and summarization of sequences. We define a multidimensional set as a set of objects and a set of associated attributes : , each object is defined as a tuple: , in which each -th dimension corresponds to the value of attribute of , denoted as .

A Multidimensional Sequence of size on a multidimensional set is defined as an ordered set of objects in : . We denote as the index, or position, of an object in a sequence . In the above definition, . For example, Figure 2 presents a sequence of objects defined on a multidimensional set with three attributes: Activity, Sector, and Responsible.

Our interest is in different forms of summarization of multidimensional sequences to improve efficiency of sequence analysis. Before defining summarization of sequences, we define the notion of many-to-one mapping of objects between multidimensional sets as an object mapping function from an original multidimensional set to a summary set , , so that for each .

Next, we define summarization of sequences based on many-to-one mapping , called -summarization:

Definition 1

A -summarization of a sequence on is defined as a summary sequence on , denoted as , where each object is replaced by its many-to-one mapping : , while retaining the same index .

A summarization of a sequence is said to preserve the sequential relationship from the original sequence if it satisfies the following definition:

Definition 2

A -summarization of a sequence , denoted as , is a sequential preserving summarization of if: , if , then , with .

By retaining the indices of objects in the original sequence, -summarization (c.f, Definition 1) preserves sequential relationships, which is vital in improving the efficiency of sequence analysis. Therefore, we define the notion of reduced -summarization, in which adjacent duplicate objects in the summary sequence are collapsed to reduce the size of a summarized sequence.

Definition 3

A reduced -summarization of a sequence on is defined as a sequence on , denoted as , where each object is replaced by its -based mapping in and, , if , then .

Theorem 4.1

A reduced -summarization is sequence preserving.

Proof

Given an original sequence on , let us denote as a sequence on and is the reduced -summarization of . Elements in can also be described as follow: and , for (i.e., is the mapping of the first non-duplicate element since ).

Let us consider and , . There are three possibilities:

  • and : In this case, we have and , .

  • and : In this case, we have and . As a result, .

  • : Since , we have according to the above definition of .

In all of the above cases, , and thus, preserves the sequential relationship between elements in .

5 Summarization

In this section, we formally present our proposed summarization schemes222Unless explicitly stated, in the remaining sections, a summarization will always refer to reduced summarization..

5.1 Attribute-based Summarization

To incorporate the multidimensional attributes of a sequence’s data items, we first define the notion of attributes compatible mapping that leverages a data item’s attributes as a summarization criteria:

Definition 4

Given a multidimensional set and a set of attributes , a mapping is defined as an -compatible mapping if: , if and only if .

Next, we define attribute-based summarization based on the definition of attributes compatible mapping:

Definition 5

Given multidimensional set and a set of attributes , an -based summarization is defined as a reduced -summarization where the mapping is an -compatible mapping on .

(a) Activity-based
(b) Sector-based
(c) Responsible-based
Figure 4: Different forms of attribute-based summarization of the trace in Example 2.

Attribute compatible summarization provides an intuitive way for users to choose attributes as a summarization criteria and produces summaries that are easy to interpret. It does not give users control over the average length of summarized sequences, which we refer to as resolution. This is because attribute values are static and already defined with the original data. Figure 4 shows examples of different attribute-based summarizations of the trace in Example 2: Activity-based (Figure 4), Sector-based (Figure 4), and Responsible-based (Figure 4). The Activity-based summary the has biggest resolution among the examples, while the Responsible-based summary has smallest resolution (i.e., the most compact summary).

Since longer summarized sequences reduce the efficiency of sequence data analysis, and attribute-based summarization offers users little flexibility in controlling that efficiency, it would be desirable if users are empowered to make the trade-off between efficiency and accuracy of data analysis, especially when dealing with large data or data of high complexity. For example, in a sequence similarity search application, users might decide to tolerate a certain level of false positives in the results (e.g., 0.9 false positive rate) to trade-off for faster response (e.g., results are returned within 5 seconds). To address this issue, we introduce a novel summarization scheme that offers more flexibility and better control over the resolution of summaries, while still capturing semantic and sequential relationships of the original data as with attribute-based summarization.

Figure 5: Topic-based representation of the process in Example 1.

5.2 Topic-based Summarization

We are motivated by the observation that business processes can often be represented by higher-level process models of fewer dimensions. Figure 5 shows an example of a more abstract process model of the one in Figure 2. Each activity in Figure 5 corresponds to multiple activities in Figure 2. We propose a topic-based summarization technique that captures the many-to-one mapping from the original sequences to a topic-based summarization of fewer dimensions, where each topic is an abstract representation of a set of original dimensions. Since the topics are implicit from the original representation of sequences, we first perform dimensionality reduction on the original sequences to transform the original dimensions to topics. Then, we define the notion of topic-based summarization using the new representation.

Before applying dimension reduction techniques to the original sequences, it is important to have an appropriate data representation for sequences. We begin by selecting an attribute of the original sequences and transform multidimensional sequences to the appropriate attribute-based summarization. It is often intuitive to pick the attribute with the most number of dimensions as this attribute likely captures the most essential information about the objects in the original multidimensional set. For example, in Example 2, Activity is the attribute with the most number of dimensions and it is also the base attribute to represent sequences, while other attributes, such as Sector and Responsible, provide supporting information for Activity.

We then represent each sequence as a numeric vector , where is the base attribute set that sequences are transformed to in the first step and is the number of dimensions on . We measure for in a way that captures both the local importance of each dimension and its specificity to a sequence. To capture the local importance, we use the frequency of the -th dimension in , denoted as , that is defined by the number of items in whose values equal the -th dimension of , denoted as . To capture the specificity, we use the popularity of a dimension across all sequences: , where is the set of all sequences. Intuitively, the higher is, the more popular the -th dimension is and thus, the less specificity it is to a sequence. The formulation of is as follows:

(1)

After representing sequences as vectors, the set of sequences can be represented as a matrix , whose size is where each row corresponds to a vector representation of a sequence in . With this matrix representation, we can apply off-the-shelf dimension reduction techniques on

, such as non-negative matrix factorization (NMF), principle component analysis (PCA), or singular value decomposition (SVD), among others. The results of these techniques can be presented as two matrices

and . , whose size equals with being the number of new dimensions (i.e., ), represents the original sequences on the summary space. , whose size equals , represents the original dimensions on the new dimensions, or topics (i.e., each row is a vector representing the distribution of an original dimension over the set of new dimensions).

Based on the results of dimensionality reduction, we now need to produce a many-to-one mapping from the original dimensions to topics. Two dimensions in the original space are likely to be in the same topic if their corresponding vectors in

have high similarity (e.g., using Cosine similarity). In addition,

and are likely to be in the same topic if they frequently appear next to each other in a sequence (i.e., they represent two closely related activities in the underlining process model). From these insights, we model the problem of finding an optimal many-to-one mapping from the original dimensions to topics as a constrained optimization problem:

(2)
subject to

where is the similarity between dimensions and based on their corresponding representation in , is the number of times and appear next to each other in input sequence set , and is used to bias towards similarity between dimensions or the number of adjacent appearances.

We now can formally define the notion of topic summarization as follows:

Definition 6

(-Topic Summarization) A -topic summarization of sequences from original multidimensional set to a summary set is defined as a reduced -summarization, where the mapping is the solution of the optimization problem defined in (2).

Finding an efficient k-topic summarization is the crux of the problem. We say “efficient” as opposed to optimal because our k-topic summarization problem is NP-hard (a variant of the set partitioning problem).Thus we resort to a “greedy” heuristic approach. Our approach is similar to that of the agglomerative clustering algorithm. It starts with treating each original dimension as a singleton cluster and then successively merges pairs of dimensions that are closest to each other until all clusters have been merged into a single cluster that contains all dimensions. This step creates a hierarchy where each leaf node is a dimension and the root is the single cluster of the last merge. Because we want a partition of disjoint

clusters as the new dimensions, the next step is to cut the hierarchy at some point to obtain the desirable number of clusters. To find the cut, we use a simple approach that is based on finding a minimum similarity threshold so that the distance between any two dimensions in the same cluster is no more than that threshold and there are at most clusters.

Figure 6 outlines the process to generate -topic summarization of sequences. There are two steps that require input from users: the number of topics (i.e., dimensions) on the summary space, and semantic labels for discovered topics. These inputs can be used by users to control the resolution of the summary space, as well as to integrate user expertise into the summarization (and thus, to the analysis tasks).

Figure 6: Topic summarization procedure.

6 Error Model for Edit-Distance on Summaries

We seek to answer the question of how to relate approximate results of analysis tasks on the summary space to those on the original space. Since similarity measure is an important operator in a lot of analysis tasks, such as similarity search and traces clustering, we focus on the relationship between the similarity of sequences on the summary space with that on the original space under edit-distance constraint: & , where is the edit-distance function and is a summarization function. We select Edit-distance as the similarity measure because it captures both the structural similarity (i.e., whether two sequences consist of data items in similar order) and content-based similarity (i.e., whether two sequences share similar set of data items) between sequences. Furthermore, Edit-distance’s results, presented as a chain of edit operators to transform a sequence to the other, can be easily interpreted by users, which makes it widely popular in practice.

In terms of the relationship between and , we are interested in the contractive and proximity preservation properties.

Definition 7

Given a summarization , we said that the edit-distance measure satisfies the contractive property on if .

The contractive property is particularly important for applications such as similarity search, because it guarantees that performing edit-distance based similarity search on the summary space using will yield results with 100% recall [12][24]. Specifically, given a query sequence and an edit-distance threshold , the similarity search task needs to find all sequences in the sequence set that have edit-distance with smaller or equal than : . If the contractive property holds for a summarization , we can avoid expensive calculation of edit-distance on the original space by finding all sequences that satisfy the threshold on the summary space: . Because if , then ; we can guarantee that if , then (i.e., 100% recall).

Definition 8

Given a summarization , we said that the edit-distance measure satisfies the proximity preservation property on if , then .

The proximity preservation property is particularly important for applications such as traces clustering that group similar traces into the same cluster. This is because the proximity preservation property guarantees that traces that are similar in the original space are also similar in the summary space. Thus, the clustering results on the summary space will likely be similar to those on the original space.

While the contractive property does not hold in general for edit-distance between summarized sequences, we show that it holds under certain circumstances. The first of such circumstances is when the summarization is a non-reduced many-to-one:

Theorem 6.1

If is a non-reduced many-to-one summarization on , as defined in Definition 1, then we have: , on .

Proof

Let us assume that , . For compact representation, we denote as and as .

As part of the recursive Wagner-Fischer algorithm to calculate edit-distance between two sequences and , let us consider the step that involves comparing two data items and (, ). If we denote the edit-distance at the current step as and (for edit-distance on summary space), based on the recursive formula of the Wagner-Fischer algorithm, we have:

If , then we have . Because of the many-to-one summarization , . Hence, . So, both and do not require any edit cost in this case.

If , then we have . Because of the many-to-one summarization , we have either or . Thus, if (i.e., one edit cost), or if (i.e., no edit cost). So, in this case, requires one edit cost, while requires either one or zero edit cost.

Therefore, we always have . Since the values and form the matrix used by recursive algorithm to calculate and respectively, then we have .

Consider the case when is a reduced many-to-one summarization. We are able to derive rules to indicate whether the contractive property holds or does not hold for edit-distance of a particular pair of sequences using summarization :

Theorem 6.2

Given two sequences in the original space , if is a reduced many-to-one summarization on , as defined in Definition 3, then:

  • If , then we have ; or edit-distance on summary space by satisfies the contractive property.

  • If , then we have ; or edit-distance on summary space by does not satisfy the contractive property.

where and , with being the length of .

Proof

This theorem can be easily proven by noticing that and in fact define the upper bound and lower bound on the edit-distance of a pair of sequences.

The first rule is proven by using the chain rule of inequality:

.

Similarly, for the second rule: .

While Theorem 6.2 does not cover all cases, we show empirically that the number of sequence pairs whose edit-distances on reduced summarization violate the contractive property is very small. Thus, it has a high recall for similarity search task when using reduced many-to-one summarization.

For the proximity preservation property, we are able to show in our evaluation that the edit distance-based traces clustering results in the summary space have comparable accuracy, compared with those in the original space, while having better efficiency. This implies that the proximity relationship is well-preserved in the summary space under edit-distance constraint.

7 Evaluation

We demonstrate the utility of our Summarized framework by evaluating its effectiveness and efficiency on two analysis tasks: trace similarity search and traces clustering.

Datasets: We use three datasets from different domains: the dataset (596 traces with 1066 types of activities, each activity has multi-dimensional attributes) that contains traces generated from the executions of a semiconductor manufacturing process, the 2015 dataset (1199 traces with 289 activity types) that contains process traces of building permit applications, and a dataset (2000 traces with 113 activity types) that consists of synthetically generated logs that represent a large bank transaction process333The dataset is provided by IBM and is private. The other datasets are available at https://data.4tu.nl/repository/collection:all.. We run our evaluation on a computer with 16GB of RAM and a 2.7GHz quad-core Intel Core i7 CPU.

Summarization schemes: We compare results of analysis tasks in the summary space using our proposed summarization schemes (i.e., and ), summarization, which randomly maps an original dimension to a new dimension in the summary space, and with the analysis results on the original space. Although -based summaries lack interpretability, as shown in [5], a random summarization scheme on sequence graph can yield good results. We vary the number of dimensions in the summary space used by and and vary the attributes used by .

k=2 k=5 k=10 k=20 k=50 k=100
Topic 0.000% 0.003% 0.006% 0.007% 0.010% 0.014%
Random 0.002% 0.010% 0.007% 0.021% 0.027% 0.033%
Figure 7: Similarity false negatives: percentage of sequence pairs in the dataset where edit-distance in the summary space violates the contractive property. There are over 177,000 total sequence pairs in the dataset.

Evaluation metrics for the similarity search task: The contractive property holds for most of the cases, as seen in Figure 7 which shows the percentage of sequence pairs in the dataset, out of over 177,000 pairs, whose edit-distances violate the contractive property in the summary space using and summarization over different number of summary dimensions . Since the recall rate is high, we focus on the false positive rate of the similarity search results. Given an edit distance threshold , this metric tells us that, out of all sequence pairs that satisfy on the summary space, how many of them actually satisfy the threshold in the original space: .

Effectiveness of summarization schemes on similarity search: Figure 8 shows the effectiveness of different summarization schemes on the similarity search task for the , , and datasets444We only evaluate summarization on the dataset because this dataset’s attributes provide better semantics compared with the ones in and .. The y-axis reports the false positive results, while the x-axis corresponds to different edit-distance thresholds. As expected (Figure 7(a), 7(b), 7(d), 7(e), 7(g), 7(f)), the higher the number of dimensions in the summary space (denoted by ), the better the result (i.e., lower false positive rates). That is because, with more dimensions in the summary space, summaries of sequences more resemble the original sequences. Thus, there is little difference between edit-distances on the summary space and in the original space (hence, lower false positive rate).

(a) Random ()
(b) Topic ()
(c) Attribute ()
(d) Random ()
(e) Topic ()
(f) Random ()
(g) Topic ()
Figure 8: False positive rates by different summarization schemes on similarity search task using the , , and datasets.

When comparing the results of different summarization schemes on the same number of dimensions, outperforms summarization (at the cost of interpretability of summaries and the efficiency, as we will show later). For (Figure 7(c)), since we do not have control over the number of dimensions (since it depends on the attribute data), the quality of the results also depend on the chosen attribute. Specifically, the 555Three main attributes of an activity are used on the data: represents the person in charged of the activity; represents the area/department where the activity is taken, and represents the tool used to perform the activity. attribute outperforms and . This is in part because there are more dimensions on ’s summary space, and thus the summaries on the space more resemble the original sequences. and produce similar results, since similar s are often used in the same .

Efficiency of summarization schemes on similarity calculation: To evaluate the efficiency of different summarization schemes, we vary the number of dimensions in the summary space and measure the time it takes to calculate the edit-distance similarity between all pairs of sequences. Figure 9 highlights the results. For both and summarizations666Again, since we could not control the number of dimensions of , we do not include it in this evaluation. However, produces similar efficiency results to the summarization schemes that share similar number of dimensions., the higher is, the longer it takes to calculate the edit-distances. This is expected because a higher results in longer sequences in the summary space, and thus it is more expensive to calculate the edit-distances. For similar values of , outperforms , which verifies ’s ability to capture the semantic relationship between the original dimensions, and thus significantly reduces the size of sequences in the summary space, as well as the processing time. More importantly, even at different values of where we observed similar effectiveness of results by and (e.g., with and with on the dataset in Figure 8), is still much more efficient than .

(a) ()
(b) ()
(c) ()
Figure 9: Efficiency comparison of processing time between and summarizations using the , , and datasets.

Evaluation metrics for the traces clustering task: We evaluate the clustering results using process-specific metrics [3][8]

: weighted average conformance fitness, and weighted average structure complexity. While the process model’s conformance fitness quantifies the extent to which the discovered model can accurately reproduce the recorded traces, the structure complexity quantifies whether the clustering results produce process models that are simple and compact. Given a summarization scheme, we first transform all sequences to the summary space, and then perform traces clustering (using hierarchical clustering) with edit-distance as the similarity measure. Then, a process model is generated for each cluster using the Heuristic mining algorithm

[32] and then converted to the Petri-Net model for conformance analysis. Given the Petri-net model, we use two publicly available plugins from the ProM framework [29] for fitness and structural complexity analysis: The Conformance Checker Plugin is used to measure the fitness of the generated process models and the Petri-Net Complexity Analysis Plugin is used to analyze the structural complexity of the process models. After fitness and complexity scores are calculated for each cluster, the final scores are calculated as the average score over all clusters, weighted by the cluster size.

Effectiveness of summarization schemes on traces clustering: Figure 11 highlights the conformance fitness of the clustering results in the summary space by different summarization schemes777We use for , and for , as the two configurations share similar effectiveness in the similarity search task. on the dataset. Surprisingly, using summarization schemes not only helps improve the efficiency of the clustering task (as we showed earlier in the efficiency evaluation), but also helps produce clusters with process models of higher fitness, compared with the clustering results in the original space. The trend is similar when varying the number of clusters . That is because measuring trace similarity on the summary space helps remove noise that often exists when measuring similarity using the original representation. Among summarization schemes, helps produce clustering results of higher conformance fitness (especially when using the attribute). That is because summarizations capture better the semantic relationship between traces (e.g., traces are similar if the corresponding sequences of , , or are similar).

Figure 10: Conformance fitness comparison.
Figure 11: Traces clustering results’ structural complexity comparison. (Green and red boxes denote best and worst results, respectively.)

In terms of the structural complexity (Figure 11), the summarizations outperform other summarization schemes and the results in the original space. This is again due to ’s ability to capture semantic relationships between traces, and thus, it helps produce clusters whose process models capture actual groups of traces that share similar semantic (and thus, have simple model structure). On the other hand, is the worst performer, due to the fact that random summarization could not capture the semantic relationship between traces.

In both conformance fitness and structural complexity tests, the summarization produces results that approach that of . Unlike summarization, which does not give users control over the resolution of the summaries, summarization provides a qualitative advantage in offering a tunable parameter, , to trade-off between the effectiveness and efficiency in the analysis task.

8 Conclusions and Future Works

In this work, we introduce Summarized, a framework to perform efficient analysis on sequence-based multi-dimensional data using intuitive and user-controlled summarizations. We define a set of summarization schemes that offer flexible trade-off between quality and efficiency of analysis tasks and derive an error model for summary-based similarity under an edit-distance constraint. Evaluation results on real-world datasets show the effectiveness and efficiency of Summarized. For future work, we plan to apply our framework to other application domains, and design our framework to run on distributed infrastructure (e.g., Map-Reduce, Spark).

References

  • [1] Van der Aalst, W., et al.: Workflow mining: Discovering process models from event logs. IEEE Transactions on Knowledge and Data Engineering (2004)
  • [2] van der Aalst, W.M.P.: Process Mining: Discovery, Conformance and Enhancement of Business Processes. Springer Publishing Company, Incorporated, 1st edn. (2011)
  • [3] Bose, R.J.C., van der Aalst, W.M.: Context aware trace clustering: Towards improving process mining results. In: Proceedings of the SDM. SIAM (2009)
  • [4] Burkhardt, S., et al.: q-gram based database searching using a suffix array (quasar). In: Proceedings of the third annual international conference on Computational molecular biology. ACM (1999)
  • [5] Chen, C., et al.: Mining graph patterns efficiently via randomized summaries. Proceedings of the VLDB Endowment (2009)
  • [6] De Masellis, R., Di Francescomarino, C., Ghidini, C., Montali, M., Tessaris, S.: Add data into business process verification: Bridging the gap between theory and practice. In: AAAI. pp. 1091–1099 (2017)
  • [7] De Medeiros, A.K.A., et al.: Process mining based on clustering: A quest for precision. In: International Conference on Business Process Management. Springer (2007)
  • [8] De Weerdt, J., et al.: Active trace clustering for improved process discovery. IEEE Transactions on Knowledge and Data Engineering (2013)
  • [9] Evermann, J.: Scalable process discovery using map-reduce. IEEE Transactions on Services Computing (2016)
  • [10] Faloutsos, C., Lin, K.I.: FastMap: A fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. ACM (1995)
  • [11] Grigori, D., Casati, F., Dayal, U., Shan, M.C.: Improving business process quality through exception understanding, prediction, and prevention. In: VLDB. vol. 1 (2001)
  • [12] Hjaltason, G.R., Samet, H.: Properties of embedding methods for similarity searching in metric spaces. IEEE Transactions on Pattern Analysis and machine intelligence (2003)
  • [13] Hristescu, G., et al.: Cluster-preserving embedding of proteins. Tech. rep., Technical Report 99-50, Computer Science Department, Rutgers University (1999)
  • [14] Khan, A., Bhowmick, S.S., Bonchi, F.: Summarizing static and dynamic big graphs. Proceedings of the VLDB Endowment 10(12), 1981–1984 (2017)
  • [15] Kim, J., Li, C., Xie, X.: Hobbes3: Dynamic generation of variable-length signatures for efficient approximate subsequence mappings. In: Data Engineering (ICDE), 2016 IEEE 32nd International Conference on. pp. 169–180. IEEE (2016)
  • [16] Kotsifakos, A., et al.: Dress: dimensionality reduction for efficient sequence search. Data Mining and Knowledge Discovery (2015)
  • [17] Leemans, S.J., et al.: Scalable process discovery with guarantees. In: International Conference on Enterprise, Business-Process and Information Systems Modeling. Springer (2015)
  • [18]

    Lerouge, J., Abu-Aisheh, Z., Raveaux, R., Héroux, P., Adam, S.: New binary linear programming formulation to compute the graph edit distance. Pattern Recognition

    72, 254–265 (2017)
  • [19] Li, C., et al.: Vgram: Improving performance of approximate queries on string collections using variable-length grams. In: Proceedings of the 33rd VLDB (2007)
  • [20] Li, Z., Jian, X., Lian, X., Chen, L.: An efficient probabilistic approach for graph similarity search. arXiv preprint arXiv:1706.05476 (2017)
  • [21] Liu, Y., Safavi, T., Dighe, A., Koutra, D.: Graph summarization methods and applications: A survey. ACM Computing Surveys (CSUR) 51(3),  62 (2018)
  • [22] Manzoor, E.A., et al.: Fast memory-efficient anomaly detection in streaming heterogeneous graphs. arXiv preprint arXiv:1602.04844 (2016)
  • [23] Nguyen, P., Slominski, A., Muthusamy, V., Ishakian, V., Nahrstedt, K.: Process trace clustering: A heterogeneous information network approach. In: Proceedings of the 2016 SIAM International Conference on Data Mining (2016)
  • [24] Papapetrou, P., et al.: Reference-based alignment in large sequence databases. Proceedings of the VLDB Endowment (2009)
  • [25] Roy, A., et al.: Massive genomic data processing and deep analysis. Proceedings of the VLDB Endowment 5(12), 1906–1909 (2012)
  • [26] Sayal, M., Casati, F., Dayal, U., Shan, M.C.: Business process cockpit. In: Proceedings of the 28th international conference on Very Large Data Bases. pp. 880–883 (2002)
  • [27] Song, M., Yang, H., Siadat, S.H., Pechenizkiy, M.: A comparative study of dimensionality reduction techniques to enhance trace clustering performances. Expert Systems with Applications 40(9), 3722–3737 (2013)
  • [28] Tian, Y., et al.: Efficient aggregation for graph summarization. In: Proceedings of the 2008 ACM SIGMOD. ACM (2008)
  • [29] Van Dongen, B.F., et al.: The prom framework: A new era in process mining tool support. In: International Conference on Application and Theory of Petri Nets. Springer (2005)
  • [30] Venkateswaran, J., et al.: Reference-based indexing of sequence databases. In: Proceedings of the 32nd international conference on Very large data bases (2006)
  • [31] Wang, X., et al.: An index structure for data mining and clustering. Knowledge and information systems (2000)
  • [32] Weijters, A., et al.: Process mining with the heuristics miner-algorithm
  • [33] Zeng, Z., et al.: Comparing stars: on approximating graph edit distance. Proceedings of the VLDB Endowment (2009)
  • [34] Zhao, P., et al.: Graph cube: on warehousing and olap multidimensional networks. In: Proceedings of the 2011 ACM SIGMOD. ACM (2011)
  • [35] Zhao, X., et al.: Efficient graph similarity joins with edit distance constraints. In: 2012 IEEE 28th International Conference on Data Engineering. IEEE (2012)
  • [36] Zheng, W., et al.: Graph similarity search with edit distance constraint in large graph databases. In: Proceedings of the 22nd CIKM. ACM (2013)