Enabling process mining directly on an operational database or data warehouse presents new opportunities. It provides additional flexibility, because event logs can be constructed on-demand by writing an SQL query, even if they are distributed over multiple tables, as is the case, for example, in SAP . It even provides opportunities for fully flexible querying, allowing for the formulation of practically any process mining question. Moreover, process mining directly on a database leverages the proven technology of databases for efficiently handling large data collections in real time, which is one of the challenges identified in the process mining manifesto . This can speed up process mining, especially when extremely large logs are going to be considered, such as call behavior of clients of a telecom provider, or driving behavior of cars on a road network.
To illustrate the potential benefits of process mining on a database, Table 1 shows a very simple event log as it could be stored in a database relation. In practice, such a log would contain thousands of events, as is the case for the well know BPI Challenge logs (e.g. ), and even millions of events in the examples of the telecom provider and the road network mentioned above. Table 1 shows the activities that were performed in an organization, the (customer) case for which these activities were performed and the start and end time of the activities. Given such a relation, it is important in process mining to be able to retrieve the ‘directly follows’ relation, because this relation is the basis for many process mining techniques, including the alpha algorithm 
(and its variants), the heuristic miner, and the fuzzy miner . The directly follows relation retrieves the events that follow each other directly in some case. The SQL query that retrieves this relation from relation in Table 1 is:
SELECT DISTINCT a.activity, b.activity FROM R a, R b WHERE a.case = b.case AND a.end_time < b.end_time AND NOT EXISTS SELECT * FROM R c WHERE c.case = a.case AND a.end_time < c.end_time AND c.end_time < b.end_time;
Another example query, is the query that returns the average waiting time before each activity:
SELECT b.activity, average(b.start_time - a.end_time) FROM R a, R b WHERE a.case = b.case AND a.end_time < b.end_time AND NOT EXISTS SELECT * FROM R c WHERE c.case = a.case AND a.end_time < c.end_time AND c.end_time < b.end_time GROUP BY b.activity;
These queries illustrate the challenges that arise when doing process mining directly on a database:
The queries are inconvenient. Even a conceptually simple process mining request like ‘retrieve all directly follows relations between events’ is difficult to phrase in SQL.
The queries are inefficient. The reason for this is that the ‘directly follows’ relation that is at the heart of process mining requires a nested query (i.e. the ‘NOT EXISTS’ part) and nested queries are known to cause performance problems in database, irrespective of the quality of query optimization [18, 10], This will be discussed in detail in Section 4.
Consequently, measures must be taken to make process mining - and in particular extracting the ‘directly follows’ relation - feasible on relational databases. Figure 1 shows three possible strategies. Figure 1.i shows the current state of the art, in which a user constructs an SQL query to extract a log from the database. This log is written to disk (for example as a csv file) and then read again by a process mining tool for further processing. Consequently, the complete log must be read or written three times and there is some manual work involved in getting the log to and from disk. It is easy to imagine a process mining tool that does not need intermediate storage to disk. Such a tool would only need to read the log once and would not require manual intervention to get the log to and from disk. Figure 1.ii illustrates this strategy.
This paper proposes a third strategy, in which the DBMS supports a native ‘directly follows’ operator. This strategy has the benefit that it does not require intermediate storage on disk and that it facilitates flexible and convenient querying for process mining related information. In addition, it has the benefit that it leverages proven database technology, which can efficiently handle very large data sets in real time. To realize this strategy, the paper defines the ‘directly follows’ operator in relational algebraic form, by defining what it does, how it behaves with respect to other operators, and what its execution costs are. By doing so this paper formally defines the necessary relational algebraic elements of a ‘directly follows’ operator, which are required for implementation of such an operator in a DBMS.
Against this background the remainder of this paper is structured as follows. Section 2 explains relational algebra as background for this paper. Section 3 presents a relational algebraic ‘directly follows’ operator. Section 4 shows the computational cost of executing this operator and the potential effects of query optimization with respect to this operator. Finally, section 5 presents related work and section 6 the conclusions.
Relational algebra is at the core of every relational database system. It is used to define the execution semantics of an SQL query and to define equivalence rules to determine which execution semantics (among a set of equivalent ones) is the most efficient to execute. Before we define a relational algebraic ‘directly follows’ operator, we provide background on these two topics.
2.1 Relational Algebra
In this section we briefly define the basic relational algebraic operators. We refer to the many textbooks on databases (e.g. ) for more detailed definitions.
Definition 1 (Attribute, Schema, Relation)
An attribute is a combination of a name and a domain of possible values. A schema is a set of attributes. Any two elements of schema with have different names. A relation is a combination of a schema and a set of tuples. Each tuple in a relation maps attribute names from the schema of the relation to values from the corresponding domain.
For example, let be the domain of case identifiers, be the domain of activities, and be the time domain. The relation of Table 1 has the schema and the set of tuples .
In the remainder of this paper, we will also refer to as the set of tuples of a relation . For a relation with a schema that defines an attribute with name , we will use to refer to the value of attribute in tuple .
Definition 2 (Relational Algebra)
Let be relations with schema . Furthermore, let be attribute names, and a condition over attributes that is constructed using conjunction (), disjunction (), negation ), and binary conditions () over attributes and attribute values. We define the usual relational algebra operators:
Selection: , where is derived from by replacing each attribute name by its value in tuple : . The schema of is .
Projection: . The schema of is restricted to the attributes with names .
Renaming: . The schema of is derived from by replacing the name of attribute by . In the remainder of this paper, we will also use to represent prefixing all attributes of with .
In addition, we define the usual set theoretic operators , , . These operators have the usual set-theoretic interpretation, but they require that and have the same schema. We define the Cartesian product of with schema and with schema as . The schema of is .
Finally, a join operator is usually defined for the commonly used operator of joining tuples from two relations that have some property in common. The join operator is a shorthand for a combination of Cartesian product and selection: .
2.2 Query Optimization
There are a large number of proven relational algebraic equivalences that can be used to rewrite relational algebraic equations [5, 21]. In the remainder of this paper, we use the following ones. Let be tables, be attributes, be attribute values, be conditions, and be a binary condition (). Then:
In practice these equivalences are used to generate alternative formulas that lead to the same result, but represent alternative execution strategies. For example, can be proven to be equivalent to . However, represents the execution strategy in which we first execute the selections and then the Cartesian product, while represents the execution strategy where we first execute the Cartesian product and then the selection. The first execution strategy is much more efficient than the second, because it only requires the Cartesian product to be computed for a (small) subset of and .
3 Relational Algebra for Process Mining
This section defines the ‘directly follows’ relation as a relational algebraic operator. It also presents and proves equivalences for this operator that can be used for query optimization, similar to the equivalences that are presented in Section 2.2.
3.1 Directly Follows Operator
The directly follows operator retrieves events that directly follow each other in some case. For a database table that has a column , which denotes the case identifier, and a column , which denotes the completion timestamp of an event, we denote this operator as . For example, applying the operator to the example log from Table 1 (i.e. ) returns Table 3. Similar to the way in which the join operator is defined in terms of other relational algebra operators, we define the ‘directly follows’ operator in terms of the traditional relational algebra operators as follows.
Definition 3 (Directly Follows Operator)
where is the set of attributes that are in or .
The directly follows operator can both be used in an algorithm for process mining that is based on it (or on ‘footprints’ which are derived from it ) and for flexible querying. Some example queries include:
The two activities that precede a rejection:
The activities in which the amount of a loan is changed:
The resources that ever changed the amount of a loan:
3.2 Directly Follows Query Optimization
To facilitate query optimization for the directly follows operator, we must define how it behaves with respect to the other operators and prove that behavior. We present this behavior as propositions along with their proofs. In each of these propositions, we use as attributes (where - as convention - we use to denote the case identifier attribute and to denote the time attribute), as a binary operator from the set , and as a value.
The first proposition holds for case attributes and event attributes. We define case attributes as attributes that keep the same value for all events in a case, from the moment that they get a value. We define event attributes as attributes that have a value for at most one event in each case. Consequently, we can only use this proposition for optimizing queries that involve a selection on a case or event attribute. Selections on other types of attributes (including resource attributes) cannot be optimized with this proposition.
Proposition 17 (directly follows and selection commute)
, if is a case or event attribute.
Note that proposition 9 requires that the condition only contains attributes that are also projected (in this case must be in ). This condition is satisfied as per definition 3. Also note that the proof uses an assumption, which states that if any two events in a case have the same value for an attribute, all events for that case that are between these two (in time) must also have that value (). This assumption holds for case attributes and for event attributes, which are the scope of this proposition.
The next proposition is a variant of the previous one, in which there is a condition on two attributes instead of an attribute and a value.
Proposition 18 (directly follows and selection commute 2)
, if are case or event attributes.
Analogous to the proof of proposition 17
To prove that directly follows and projection commute, we first need to prove that projection and set minus commute, because the set minus operator is an important part of the definition of the directly follows operator. However, for the general case it is not true that projection and set minus commute. A counter example is easily constructed. Let and . For these relations it does not hold that . However, we can prove this proposition for the special case that is a subset of and uniquely identifies tuples in . Since these conditions are satisfied for the directly follows operator, it is sufficient to prove the proposition under these conditions.
Proposition 19 (projection and restricted set minus commute)
, if and uniquely identifies each tuple in .
This equivalence is proven by observing that implies that a non-surjective injective function exists that matches each tuple in to a unique tuple in . The fact that uniquely identifies tuples in (and also in , because is a subset of ) implies that is completely determined by the values of tuples in , i.e., the values of attributes other than have no consequence for . Therefore, projecting and onto does not change the tuple mapping defined by .
Now, looking at the left side of proposition 19, calculating the projection over the difference, means removing the attributes not in from the selected tuples in that are not in the range of . Looking at the right side, calculating the difference over the projections, means removing the attributes not in a from and (which does not affect ) and then selecting the tuples in that are not in the range of . These two are equivalent.
Proposition 20 (directly follows and restricted projection commute)
The next proposition, which states that the directly follows relation and the theta join commute, makes it explicit that the directly follows relation duplicates all attributes of a log event. Table 3 illustrates this. However, if the case, activity and start time attribute uniquely identify an event, then there is no need to duplicate the end time attribute or any other attribute. Nonetheless, the directly follows operator adds all attributes both on the and on the side of the table. This redundancy can easily be fixed later on with a project operator and in future work additional efficiency may be achieved by avoiding this redundancy altogether.
Proposition 21 (directly follows and theta join commute)
, if each tuple from is combined with a tuple in .
4 Execution cost
We determine the computational cost of executing the directly follows operation, either as part of a process mining tool or as an operation that is executed directly on the database. We also determine the effect of query optimization on the directly follows operator.
4.1 Cost of computing the directly follows relation
The execution cost of a database operation is typically defined in terms of the number of disk blocks read or written, because reading from or writing to disk are the most expensive database operations. In line with the strategies for process mining on a database that are presented in Figure 1, Table 4 shows four execution strategies with their costs. (Note that the ‘with database operator’ strategy from Figure 1 is split up into two alternatives.) The cost is presented as an order of magnitude, measured in terms of the number of disk blocks that must be read or written. The number of disk blocks is linear with the number of events in the log and depends on the number of bytes needed to store an event and the number of bytes per disk block. These measures assume that the complete log fits into memory.
|execution strategy||order of costs (disk blocks)|
|classical process mining|
|with intermediate storage|
|with database connection|
|with database operator|
|composite operator||up to|
Process mining with intermediate storage requires that the log is read and written three times: once to query the database for the log, once to store the log to disk, and once to load the log in the process mining tool. Consequently, the complexity is . Process mining directly on a database requires that the log is read only once. Subsequent processing can be done in memory.
In many usage scenarios more flexible querying capabilities are needed, which can benefit from access to all SQL operators. For such usage scenarios, the ‘directly follows’ relation must be extracted directly from the database. It is easy to imagine how a native ‘directly follows’ operator would work. Such an operator would read the log, then sort it by case identifier and timestamp, and finally return each pair of subsequent rows that have the same case identifier. Such an operator would have to read the log from disk only once and consequently has linear cost. For databases that do not have a native ‘directly follows’ operator, the operator can be emulated using the composite formula from definition 3. The drawback of this formula is that it requires that the intermediate results from both sides of the minus operator are stored, after which the minus operator can be applied. While this is no problem as long as the intermediate results fit into memory, the costs become prohibitive once the intermediate results must be written to disk.
On a practical level, this problem manifested itself, for example, for the log of the BPI 2011 challenge  on our desktop with an i5 processor, 8GB of internal memory and an SSD drive, using MySQL and the standard MySQL buffer size. Each attempt to perform a database query that involved a composite ‘directly follows’ relation, needed at least 10 minutes to execute, which is prohibitive for interactive exploration of an event log.
On a theoretical level, the problem is illustrated in Figure 2. This figure shows that the problem arises when the number of events in the log is high, relative to the number of cases. The mechanism that causes this is easy to derive from definition 3, which shows that the intermediate results that must be stored are the pairs of events that directly or indirectly follow each other in some case (the left side and right side of the minus operator). Consequently, if there are many events per case, this number is high (cubic over the number of events per case in the right-hand side of the minus operator).
The precise calculation can be performed as follows. Let be the number of cases in the log, be the number of events, be the block size (i.e. the number of tuples/events that fit in a single disk block), the number of disk blocks required to store the log, and be the total memory size in blocks. Note that the cost of a block nested join (or minus) operator on two relations and that take and disk blocks (with ), is equal to when one of the two relations fits in memory, and equal to otherwise . The cost is split up into five components:
The cost of the first join is denoted as . This equals if the log fits into memory and otherwise. Note that this join appears twice, but that it only needs to be computed once.
The cost of storing the results of the first join to disk is denoted as . This equals 0 if the result fits in memory. Otherwise, the number of tuples in the result, which we denote as , equals the number of pairs of events that directly or indirectly follow each other in some case: on average. This fits into disk blocks (times 2 because each tuple in the result is a pair of tuples from the original).
The cost of the second join is denoted as . This equals 0 if the original log fits into memory. Otherwise, the cost equals .
The cost of storing the result of the second join to disk is denoted as . This equals 0 if the result fits into memory. Otherwise, the number of tuples in the result, which we denote as , equals the number of pairs of events that indirectly follow each other. This equals the number of pairs of events that directly of indirectly follow each other minus the number of pairs of events that directly follow each other: on average. This fits into disk blocks (times 2 because each tuple in the result is a triple of tuples from the original and then reduced to a pair by projection).
The cost of the minus operator is denoted as . This equals 0 if the result of the second join fits into memory. Otherwise, it equals .
To generate Figure 2 we used a tuple size of 80 bytes, a 4 GB buffer size, and a block size of 50, such that there is a total memory size of 1 million blocks. The figure shows two ‘thresholds’ in the computational cost. These thresholds are crossed when a particular intermediate result no longer fits into memory.
The order of the cost can be determined more easily. The order of the cost is determined the cost of the set minus, because this incorporates both intermediate results, which are typically much larger than the original log. Therefore, the order of the cost of computing the intermediate results are . The total order of cost is then obtained by filling these costs out in the right-hand side of the formula for computing the cost of the set minus, which yields: . If we let be large enough to contain the log itself, but not the intermediate results (i.e. we set ), this can be simplified as: .
Summarizing, the execution cost of flexibly retrieving a directly follows relation directly from a database can be as low as retrieving it from a process mining tool, if the database supports a native ‘directly follows’ operator and the process mining tool supports on-database process mining. However, as long as a native ‘directly follows’ operator does not exist, the execution costs increase to third order polynomial cost if the average number of events per case is high (i.e. if intermediate results do not fit into memory anymore).
4.2 The effect of query optimization
An advantage of in-database process mining is that it enables query optimization. Query optimization, using the rewrite rules that are defined in section 3.2 can greatly reduce the cost of executing a query. As an example, we show the cost of executing the query and the equivalent query . These costs decrease at least linearly with the fraction of events that match the selection criteria. Let be that fraction. Table 5 shows the different execution situations that can arise. It is possible to either first derive the directly follows relation and then do the selection, or vice versa. It is also possible that the intermediate results fit in memory, or that they must be stored on disk. If the results fit in memory (and the table is indexed with respect to the variable on which the selection is done), then the execution costs are simply the cost of reading the log, or the part that matches the selection criteria, into memory once. If the intermediate results do not fit into memory, the order of the execution cost is as explained in the previous section. Remembering that leads to the formulas that are shown in the table.
|execution sequence||in memory (blocks)||on disk (blocks)|
The most dramatic increase occurs when, if the selection is done first and as a consequence the intermediate results fit into memory, while if the selection is done last, the intermediate results do not fit into memory. In practice this is likely to be the case, because the selection can greatly reduce the number of events that are considered. For example, for a log with events over cases, with a block size of and a selection fraction , the order of the cost increases from to according to the formulas from Table 5. The actual computed costs increase (the same order of magnitude) from (plus one, because we need to read one disk block to load the index that is used to optimize the selection) to using the formulas from the previous section.
This shows that the way in which a query that includes the directly follows operator is executed greatly influences the execution cost. Query optimizers, which are parameterized with equivalence relations from section 3.2 and the cost calculation functions from section 4.1, can automatically determine the optimal execution strategy for a particular query.
5 Related Work
By defining an operator for efficiently extracting the ‘directly follows’ relation between events from a database, this paper has its basis in a long history of papers that focus on optimizing database operations. In particular, it is related to papers that focus on optimizing database operations for data mining purposes [11, 7], of which SAP HANA  is a recent development. The idea of proposing domain-specific database operators has also been applied in other domains, such as spatio-temporal databases  and scientific databases .
By presenting a ‘directly follows’ operator, the primary goal of this paper is to support computationally efficient process mining on a database. There exist some papers that deal with the computational complexity of the process mining algorithms themselves [19, 8]. Also, in a research agenda for process mining the computational complexity and memory usage of process mining algorithms have been identified as important topics . However, this paper focuses on a step that precedes the process mining itself: flexibly querying a database to investigate which information is useful for process mining.
This paper presents a first step towards in-database process mining. In particular, it completely defines a relational algebraic operator to extract the ‘directly follows’ relation from a log that is stored in a relational database, possibly distributed over multiple tables. The paper presents and proves relational algebraic properties of this operator, in particular that the operator commutes with the selection, projection, and theta join. These equivalence relations can be used for query optimization. Finally, the paper presents and proves formulas to estimate the computational cost of the operator. These formulas can be used in combination with the equivalence relations to determine the most efficient execution strategy for a query. By presenting and proving these properties, the paper provides the complete formal basis for implementing the operator into a specialized DBMS for process mining, which can be used to efficiently and conveniently query a database for process mining information.
Consequently, the obvious next step of this research is to implement the operator into a DBMS. The DBMS and the relational algebraic operators can then be further extended with additional process mining-specific operators, such as an operator to query for execution traces. In addition, more algebraic properties of those operators can be proven to assist with query optimization.
There are some limitations to the equivalence relations that are presented in this paper, in particular with respect to the conditions under which they hold. These limitations restrict the possibilities for query optimization. The extent to which these theoretical limitations impact practical performance of the operator must be investigated and, if possible, mitigated.
-  van der Aalst, W., et al.: Process mining manifesto. In: Proc. of BPM Workshops. pp. 169–194 (2012)
-  van der Aalst, W.: Process Mining: Discovery, Conformance and Enhancement of Business Processes. Springer (2011)
-  van der Aalst, W., Weijters, A.: Process mining: a research agenda. Computers in Industry 53(3), 231–244 (2004)
-  van der Aalst, W., Weijters, A., Maruster, L.: Workflow mining: Discovering process models from event logs. IEEE Transactions on Knowledge and Data Engineering 16(9), 1128–1142 (2004)
-  Abiteboul, S., Hull, R., Vianu, V.: Foundations of databases: the logical level. Addison-Wesley (1995)
-  Abraham, T., Roddick, J.F.: Survey of spatio-temporal databases. GeoInformatica 3(1), 61–99 (1999)
-  Agrawal, R., Imielinski, T., Swami, A.: Database mining: a performance perspective. IEEE Transactions on Knowledge and Data Engineering 5(6), 914–925 (1993)
-  Bergenthum, R., Desel, J., Lorenz, R., Mauser, S.: Process Mining Based on Regions of Languages, pp. 375–383 (2007)
-  Blasgen, M.W., Eswaran, K.P.: Storage and access in relational data bases. IBM Systems Journal 16(4), 363–377 (1977)
-  Chaudhuri, S.: An overview of query optimization in relational systems. In: Proc. of PODS. pp. 34–43. New York, NY, USA (1998)
-  Chen, M.S., Han, J., Yu, P.S.: Data mining: an overview from a database perspective. IEEE Transactions on Knowledge and Data Engineering 8(6), 866–883 (1996)
-  Cudre-Mauroux, P., Kimura, H., Lim, K.T., Rogers, J., Simakov, R., Soroush, E., Velikhov, P., Wang, D.L., Balazinska, M., Becla, J., DeWitt, D., Heath, B., Maier, D., Madden, S., Patel, J., Stonebraker, M., Zdonik, S.: A demonstration of scidb: A science-oriented dbms. Proc. VLDB Endow. 2(2), 1534–1537 (2009)
-  van Dongen, B.: Real-life event logs - hospital log (2011)
-  Eder, J., Olivotto, G.E., Gruber, W.: A Data Warehouse for Workflow Logs, pp. 1–15 (2002)
-  Färber, F., Cha, S.K., Primsch, J., Bornhövd, C., Sigg, S., Lehner, W.: Sap hana database: Data management for modern business applications. SIGMOD Rec. 40(4), 45–51 (2012)
-  Günther, C., van der Aalst, W.: Fuzzy mining: Adaptive process simplification based on multi-perspective metrics. In: Proc. of BPM 2007. pp. 328–343 (2007)
-  Ingvaldsen, J.E., Gulla, J.A.: Preprocessing support for large scale process mining of sap transactions. In: Proc. of BPM. pp. 30–41 (2007)
-  Kim, W.: On optimizing an SQL-like nested query. ACM Trans. on Database Systems 7(3), 443–469 (1982)
-  Maggi, F.M., Bose, R.P.J.C., van der Aalst, W.M.P.: Efficient Discovery of Understandable Declarative Process Models from Event Logs, pp. 270–285 (2012)
-  zur Mühlen, M.: Process-driven manaement information systems – combining data warehouses and workflow technology. In: Proc. of ICECR. pp. 550–566 (2001)
-  Sagiv, Y., Yannakakis, M.: Equivalcen among relational expressions with the union and difference operation. In: Proc. of VLDB. pp. 535–548 (1978)
-  Weijters, A., van der Aalst, W., Alves De Medeiros, A.: Process mining with the heuristics miner-algorithm. Technische Universiteit Eindhoven, Tech. Rep. WP 166, 1–34 (2006)