# Foundations of Complex Event Processing

Complex Event Processing (CEP) has emerged as the unifying field for technologies that require processing and correlating heterogeneous distributed data sources in real-time. CEP finds applications in diverse domains, which has resulted in a large number of proposals for expressing and processing complex events. However, existing CEP frameworks are based on ad-hoc solutions that do not rely on solid theoretical ground, making them hard to understand, extend or generalize. Moreover, they are usually presented as application programming interfaces documented by examples, and using each of them requires learning a different set of skills. In this paper we embark on the task of giving a rigorous framework to CEP. As a starting point, we propose a formal language for specifying complex events, called CEPL, that contains the common features used in the literature and has a simple and denotational semantics. We also formalize the so-called selection strategies, which are the cornerstone of CEP and had only been presented as by-design extensions to existing frameworks. With a well-defined semantics at hand, we study how to efficiently evaluate CEPL for processing complex events. We provide optimization results based on rewriting formulas to a normal form that simplifies the evaluation of filters. Furthermore, we introduce a formal computational model for CEP based on transducers and symbolic automata, called match automata, that captures the regular core of CEPL, i.e. formulas with unary predicates. By using rewriting techniques and automata-based translations, we show that formulas in the regular core of CEPL can be evaluated using constant time per event followed by constant-delay enumeration of the output (under data complexity). By gathering these results together, we propose a framework for efficiently evaluating CEPL, establishing solid foundations for future CEP systems.

## Authors

• 4 publications
• 13 publications
• 4 publications
• ### Symbolic Automata with Memory: a Computational Model for Complex Event Processing

We propose an automaton model which is a combination of symbolic and reg...
04/26/2018 ∙ by Elias Alevizos, et al. ∙ 0

• ### Symbolic Register Automata for Complex Event Recognition and Forecasting

We propose an automaton model which is a combination of symbolic and reg...
10/08/2021 ∙ by Elias Alevizos, et al. ∙ 0

• ### A Second-Order Approach to Complex Event Recognition

Complex Event Recognition (CER for short) refers to the activity of dete...
12/04/2017 ∙ by Alejandro Grez, et al. ∙ 0

• ### Constant delay algorithms for regular document spanners

Regular expressions and automata models with capture variables are core ...
03/14/2018 ∙ by Fernando Florenzano, et al. ∙ 0

• ### CORE: a COmplex event Recognition Engine

Complex Event Recognition (CER) systems are a prominent technology for f...
11/08/2021 ∙ by Marco Bucchi, et al. ∙ 0

• ### A Microservices Architecture for Distributed Complex Event Processing in Smart Cities

A considerable volume of data is collected from sensors today and needs ...
08/17/2020 ∙ by Fernando Freire Scattone, et al. ∙ 0

• ### Lifestate: Event-Driven Protocols and Callback Control Flow

Developing interactive applications (apps) against event-driven software...
06/12/2019 ∙ by Shawn Meier, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Complex Event Processing (CEP) has emerged as the unifying field of technologies for detecting situations of interest under high-throughput data streams. In scenarios like Network Intrusion Detection [43], Industrial Control Systems [33] or Real-Time Analytics [46], CEP systems aim to efficiently process arriving data, giving timely insights for implementing reactive responses to complex events.

Prominent examples of CEP systems from academia and industry include SASE [53], EsperTech [2], Cayuga [30], TESLA/T-Rex [26, 27], among others (see [28] for a survey). The main focus of these systems has been in practical issues like scalability, fault tolerance, and distribution, with the objective of making CEP systems applicable to real-life scenarios. Other design decisions, like query languages, are generally adapted to match computational models that can efficiently process data (see for example [54]). This has produced new data management and optimization techniques, generating promising results in the area [53, 2].

Unfortunately, as has been claimed several times [31, 55, 26, 15] CEP query languages lack from a simple and denotational semantics, which makes them difficult to understand, extend, or generalize. The semantics of several languages are defined either by examples [40, 8, 25], or by intermediate computational models [53, 48, 44]. Although there are frameworks that introduce formal semantics (e.g. [30, 19, 11, 26, 12]), they do not meet the expectations to pave the foundations of CEP languages. For instance, some of them are too complicated (e.g. sequencing is combined with filters), have unintuitive behavior (e.g. sequencing is non-associative), or are severely restricted (e.g. nesting operators is not supported). One symptom of this problem is that iteration, which is a fundamental operator in CEP, has not yet been defined successfully as a compositional operator. Since iteration is difficult to define and evaluate, it is usually restricted by not allowing nesting or reuse of variables [53, 30]. Thus, without a formal and natural semantics, the languages for CEP are in general cumbersome.

The lack of simple denotational semantics makes query languages also difficult to evaluate. A common factor in CEP system is to find sophisticated heuristics

[54, 26] that cannot be replicated in other frameworks. Further, optimization techniques are usually proposed at the architecture level [41, 30, 44], preventing from a unifying optimization theory. In this direction, many CEP frameworks use automata-based models [30, 19, 11] for query evaluation. However, these models are usually complicated [44, 48], informally defined [30] or non-standard [26, 9]. In practice this implies that, although finite state automata is a recurring approach in CEP, there is no general evaluation strategy with clear performance guarantees.

Given this scenario, the goal of this paper is to give solid foundations to CEP systems in terms of query language and query evaluation. Towards these goals, we first provide a formal language that allows for expressing the most common features of CEP systems, namely sequencing, filtering, disjunction, and iteration. We introduce complex event logic (CEL for short), a logic with well-defined compositional and denotational semantics. We also formalize the so-called selection strategies, an important notion of CEP that is usually discussed directly [54, 30] or indirectly [19] in the literature but has not been formalized at the language level.

Then, we embark on the design of a formal framework for CEL evaluation. This framework must consider three main building blocks for the efficient evaluation of CEL: (1) syntactic techniques for rewriting CEL queries, (2) a well-defined intermediate evaluation model, and (3) efficient translation and algorithms to evaluate this model. Regarding the rewriting techniques, we study the structure of CEL by introducing the notions of well-formed and safe formulas, and show that these restrictions are relevant for query evaluation. Further, we give a general result on rewriting CEL formulas into the so-called LP-normal form, a normal form for dealing with unary filters. For the intermediate evaluation model, we introduce a formal computational model for the regular fragment of CEL, called complex event automata (CEA). We show that this model is closed under I/O-determinization and provide translations for any CEL formula into CEA. More important, we show an efficient algorithm for evaluating CEA with clear performance guarantees: constant time per tuple followed by constant-delay enumeration of the output. We bring together our results to present a formal framework for evaluating CEL. Towards the end of the paper, we show an experimental evaluation of our framework with the leading CEP systems in the area. Our experiments shows that our framework outperforms previous systems by order of magnitudes in terms of processing time and memory consumption.

Related work. Active Database Systems (ADSMS) and Data Stream Management Systems (DSMS) are solutions for processing data streams and they are usually associated with CEP systems. Both technologies, and specially DSMS, are designed for executing relational queries over dynamic data [23, 6, 13]. In contrast, CEP systems see data streams as a sequence of data events where the arrival order is the main guide for finding patterns inside streams (see [28] for a comparison between ADSMS, DSMS, and CEP). In particular, DSMS query languages (e.g. CQL [14]) are incomparable with our framework since they do not focus on CEP operators like sequencing and iteration.

Query languages for CEP are usually divided into three approaches [28, 15]: logic-based, tree-based and automata-based models. Logic-based models have their roots in temporal logic or event calculus, and usually have a formal, declarative semantics [12, 16, 24] (see [17] for a survey). However, this approach does not include iteration as an operator or they do not model the output explicitly. Furthermore, their evaluation techniques rely on logic inference mechanisms which are radically different from our approach. Tree-based models [42, 39, 2] have also been used for CEP but their language semantics is usually non-declarative and their evaluation techniques are based on cost-models, similar to relational database systems.

Automata-based models are the closest approach to the techniques used in this paper. Most proposals (e.g. SASE[9], NextCEP[48], DistCED[44]) do not rely in a denotational semantics; their output is defined by intermediate automata models. This implies that either iteration cannot be nested [9] or its semantics is confusing [48]. Other proposals (e.g. CEDR[19], TESLA[26], PBCED[11]) are defined with a formal semantics but they do not include iteration. An exception to this is Cayuga[29] but its language does not allow the reuse of variables and the sequencing operator is non-associative, which derives in a cumbersome semantics. Our framework is comparable to these systems, but provides a well-defined language that is compositional, allowing arbitrary nesting of operators. Moreover, we present the first evaluation of CEP queries that guarantees constant time per event and constant-delay enumeration of the output. We show experimentally that this vastly improves performance.

Finally, there has been some research in theoretical aspects of CEP like, for instance, in axiomatization of temporal models [52], privacy [36], and load shedding [35]. This literature does not study the semantics and evaluation of CEP and, therefore, is orthogonal to our work.

Organization. We give an intuitive introduction to CEP and our framework in Section 2. In Section 3 and 4 we formally present our logic and selection strategies. The syntactic structure of the logic is studied in Section 5. The computational model is studied in Section 6 where we also show how to compile formulas into automata. Section 7 presents our algorithms for efficient evaluation of automata. Section 8 puts all the results in perspective and shows our experimental evaluation of the framework. Future work is finally discussed in Section 9. Due to space limitations all proofs are deferred to the appendix.

## 2 Events in Action

We start by presenting the main features and challenges of CEP. The examples used in this section will also serve throughout the paper as running examples.

In a CEP setting, events arrive in a streaming fashion to a system that must detect certain patterns [28]. For the purpose of illustration assume there is a stream produced by wireless sensors positioned in a farm, whose main objective is to detect fires. As a first scenario, assume that there are three sensors, and each of them can measure both temperature (in Celsius degrees) and relative humidity (as the percentage of vapor in the air). Each sensor is assigned an id in . The events produced by the sensors consist of the id of the sensor and a measurement of temperature or humidity. In favor of brevity, we write for an event reporting temperature from sensor with id , and similarly for events reporting humidity. Figure 1 depicts such a stream: each column is an event and the value row is the temperature or humidity if the event is of type or , respectively.

The patterns to be deetcted are generally specified by domain experts. For the sake of illustration, assume that the position of sensor

is particularly prone to fires, and it has been detected that a temperature measurement above 40 degrees Celsius followed by a humidity measurement of less than 25% represents a fire with high probability. Let us intuitively explain how a domain expert can express this as a pattern (also called a

formula) in our framework:

 φ1=(T AS x;H AS y) FILTER (x.tmp>40 ∧y.hum<=25 ∧x.id=0 ∧y.id=0)

This formula is asking for two events, one of type temperature () and one of type humidity (). The events of type temperature and humidity are given names and , respectively, and the two events are filtered to select only those pairs representing a high temperature followed by a low humidity measured by sensor 0.

What should be the result of evaluating over the stream in Figure 1? A first important remark is that event streams are noisy in practice, and one does not expect the events matching a formula to be contiguous in the stream. Then, a CEP engine needs to be able to dismiss irrelevant events. The semantics of the sequencing operator (;) will thus allow for arbitrary events to occur in between the events of interest. A second remark is that in CEP the set of events matching a pattern, called a complex event, is particularly relevant to the end user. Every time that a formula matches a portion of the stream, the final user should retrieve the events that compose that portion of the stream. This means that the evaluation of a formula over a stream should output a set of complex events. In our framework, each complex event will be the set of indexes (stream positions) of the events that witness the matching of a formula. Specifically, let be the event at position of the stream . What we expect for the output of formula is a set of pairs such that is of type , is of type , , and they satisfy the conditions expressed after the FILTER. By inspecting Figure 1, we can see that the pairs satisfying these conditions are , , and .

Formula illustrates in a simple way the two most elemental features of CEP, namely sequencing and filtering [28, 13, 54, 6, 21]. But although it detects a set of possible fires, it restricts the order in which the two events must occur, namely the temperature must be measured before the humidity. Naturally, this could prevent the detection of a fire in which the humidity was measured first. This motivates the introduction of disjunction, another common feature in CEP engines [28, 13]. To illustrate, we extend by allowing events to appear in arbitrary order.

 φ2=[(T AS x;H AS y) OR (H AS y;T AS x)] FILTER(x.tmp>40 ∧y.hum<=25∧x.id=0 ∧y.id=0)

The operator allows for any of the two patterns to be matched, and the filter is applied as in . The result of evaluating over the stream of Figure 1 is the same as evaluating over plus the complex event .

The previous formulas show how CEP systems raise alerts when a certain complex event occurs. However, from a wider scope the objective of CEP is to retrieve information of interest from streams. For example, assume that we want to see how does temperature change in the location of sensor 1 when there is an increase of humidity. A problem here is that we do not know a priori the amount of temperature measurements; we need to capture an unbounded amount of events. The iteration operator [28, 13] (also known as Kleene closure [34]) is introduced in most CEP frameworks for solving this problem. This operator introduces many difficulties in the semantics of CEP languages. For example, since events are not required to occur contiguously, the nesting of is particularly tricky and most frameworks simply disallow this (see [53, 14, 30]). Coming back to our example, the formula for measuring temperatures whenever an increase of humidity is detected by sensor is:

 φ3=[H AS x;(T AS y FILTER y.id=1)+;H AS z] FILTER (x.hum<30 ∧z.hum>60∧x.id=z.id=1)

Intuitively, variables and witness the increase of humidity from less than 30% to more than 60%, and captures temperature measures between and . Note that the filter for is included inside the operator. Some frameworks allow to declare variables inside a and filter them outside that operator (e.g. [53]). Although it is possible to define the semantics for that syntax, this form of filtering makes the definition of nesting difficult. Another semantic subtlety of the + operator is the association of to an event. Given that we want to match the event an unbounded number of times: how should the events associated to occur in the complex events generated as output? Associating different events to the same variable during evaluation has proven to make the semantics of CEP languages cumbersome. In Section 3, we introduce a natural semantics that allows nesting and associate variables (inside operators) to different events across repetitions.

Let us now explain the semantics of over stream (Figure 1). The only two humidity events satisfying the top-most filter are and and the temperature events between these two are and . As expected, the complex event is part of the output. However, there are also other complex events in the output. Since, as discussed, there might be irrelevant events between relevant ones, the semantics of must allow for skipping arbitrary events. This implies that the complex events and are also part of the output.

The previous discussion raises an interesting question: are users interested in receiving all complex events? Are some complex events more informative than others? Coming back to the output of (, and ), one can easily argue that the largest complex event is more informative than others since all events are contained in it. A more complicated analysis deserves the complex events output by . In this scenario, the pairs that have the same second component (e.g., and ) represent a fire occurring at the same place and time, so one could argue that only one of the two is necessary. For cases like above, it is common to find CEP systems that restrict the output by using so-called selection strategies (see for example  [53, 54, 26]). Selection strategies are a fundamental feature of CEP. Unfortunately, they have only been presented as heuristics applied to particular computational models, and thus their semantics given by an algorithm and hard to understand. A special mention deserves the next selection strategy (called skip-till-next-match in [53, 54]) which models the idea of outputting only those complex events that can be generated without skipping relevant events. Although the semantics of next has been mentioned in previous papers (e.g [19]), it is usually underspecified [53, 54] or complicates the semantics of other operators [30]. In Section 4, we formally define a set of selection strategies including next.

Before formally presenting our framework, we illustrate one more common feature of CEP, namely correlation. Correlation is introduced by filtering events with predicates that involve more than one event. For example, consider that we want to see how does temperature change at some location whenever there is an increase of humidity, like in . What we need is a pattern where all the events are produced by the same sensor, but that sensor is not necessarily sensor 1. This is achieved by the following pattern:

 φ4=[H AS x;(T AS y FILTER y.id=x.id)+;H AS z] FILTER (x.hum<30 ∧z.hum>60∧x.id=z.id)

Notice that here the filters contain the binary predicates and that force all events to have the same id. Although this might seem simple, the evaluation of formulas that correlate events introduces new challenges. Intuitively, formula is more complicated because the value of must be remembered and used during evaluation in order to compare it with future incoming events. If the reader is familiar with automata theory [37, 47], this behavior is clearly not “regular” and it will not be captured by a finite state model. In this paper, we study and characterize the regular part of CEP-systems. Therefore, from Section 6 to Section 8 we focus on formulas without correlation. As we will see, the formal analysis of this fragment already presents important challenges, which is the reason why we defer the analysis of formulas like for future work. It is important to mention that the semantics of our language (including selection strategies) is general and includes correlation.

## 3 A query language for CEP

Having discussed and illustrated the common operators and features of CEP, we proceed to formally introduce CEL (Complex Event Logic), our pattern language for capturing complex events.

Schemas, Tuples and Streams. Let be a set of attribute names and be a set of values. A database schema is a finite set of relation names, where each relation name is associated to a tuple of attributes in denoted by . If is a relation name, then an -tuple is a function . We say that the type of an -tuple is , and denote this by . For any relation name , denotes the set of all possible -tuples, i.e., . Similarly, for any database schema , .

Given a schema , an -stream is an infinite sequence where . When is clear from the context, we refer to simply as a stream. Given a stream and a position , the -th element of is denoted by , and the sub-stream of is denoted by . Note that we consider in this paper that the time of each event is given by its index, and defer a more elaborated time model (like [52]) for future work.

Let be a set of variables. Given a schema , a predicate of arity is an -ary relation over , i.e. . An atom is an expression (or ) where is an -ary predicate and . For example, is an atom and is the predicate of all tuples that have a humidity attribute with value less than . In this paper, we consider a fixed set of predicates, denoted by . Moreover, we assume that is closed under intersection, union, and complement, and contains the predicate for checking if a tuple is an -tuple for everv .

CEL syntax. Now we proceed to give the syntax of what we call the core of CEL (core-CEL for short), a logic inspired by the operations described in the previous section. This language features the most essential CEP features. The set of formulas in core-CEL, or core formulas for short, is given by the following grammar:

 φ:=R AS x ∣ φ FILTER P(¯x) ∣ φ OR φ ∣ φ;φ ∣ φ+

Here is a relation name, is a variable in and is an atom in . All formulas in Section 2 are CEL formulas. Furthermore, formulas of the form or are used as syntactic sugar for or , respectively. As opposed to existing frameworks, we do not restrict the use of operators or variables, allowing for arbitrary nestings (in particular of ).

CEL semantics. We proceed to define the semantics of core formulas, for which we need to introduce some further notation. A complex event is defined as a non-empty and finite set of indices. As mentioned in Section 2, a complex event contains the positions of the events that witness the matching of a formula over a stream, and moreover, they are the final output of evaluating a formula over a stream. We denote by the size of and by and the minimum and maximum element of , respectively. Given two complex events and , denotes the concatenation of two complex events, that is, whenever and is undefined otherwise.

In core-CEL formulas, variables are second class citizens because they are only used to filter and select particular events, i.e. they are not retrieved as part of the output. As examples in Section 2 suggest, we are only concerned with finding the events that compose the complex events, and not which position corresponds to which variable. The reason behind this is that the operator allows for repetitions, and therefore variables under a (possibly nested) operator would need to have a special meaning, particularly for filtering. This discussion motivates the following definitions. Given a formula we denote by the set of all variables mentioned in (including its predicates), and by all variables defined in by a clause of the form . Furthermore, denotes all variables in that are defined outside the scope of all operators. For example, for we have that , , and . Finally, a valuation is a function . Given a finite set of variables and two valuations and , the valuation is defined by if and by otherwise.

We are ready to define the semantics of a core-CEL formula . Given a complex event and a stream , we say that is in the evaluation of over under valuation () if one of the following conditions holds:

• , , and .

• and both and hold.

• and or .

• and there exist complex events and such that .

• and there exists such that or , where .

There are a couple of important remarks here. First, the valuation can be defined over a superset of the variables mentioned in the formula. This is important for sequencing (;) because we require the complex events from both sides to be produced with the same valuation. Second, when we evaluate a subformula of the form , we carry the value of variables defined outside the subformula. For example, the subformula of does not define the variable . However, from the definition of the semantics we see that will be already assigned (because occurs outside the subformula). This is precisely where other frameworks fail to formalize iteration, as without this construct it is not easy to correlate the variables inside + with the ones outside, as we illustrate with .

As previously discussed, in core-CEL variables are just used for comparing attributes with , but are not relevant for the final output. In consequence, we say that belongs to the evaluation of over (denoted ) if there is a valuation such that . As an example, the complex events presented in Section 2 are indeed the outputs of to over the stream in Figure 1.

## 4 Selection strategies

Matching complex events is a computationally intensive task. As the examples in Section 2 might suggest, the main reason behind this is that the amount of complex events can grow exponentially in the size of the stream, forcing systems to process large numbers of candidate outputs. In order to speed up the matching processes, it is common to restrict the set of results [22, 53, 54]. As we validate in the experimental section, this is required for current CEP systems to work in practice. Unfortunately, most proposals in the literature restrict outputs by introducing heuristics to particular computational models without describing how the semantics are affected. For a more general approach, we introduce selection strategies (or selectors) as unary operators over core-CEL formulas. Formally, we define four selection strategies called strict (), next (), last () and max (). and are motivated by previously introduced operators [53] under the name of strict-contiguity and skip-till-next-match, respectively. and are introduced here as useful selection strategies from a semantic point of view. We proceed to define each selection strategy below, giving the motivation and formal semantics.

STRICT. As the name suggest, or strict-contiguity keeps only the complex events that are contiguous in the stream, basically reducing the evaluation problem to that of regular expressions. To motivate this, recall that formula in Section 2 detects complex events composed by a temperature above 40 degrees Celsius followed by a humidity of less than 25%. As already argued, in general one could expect other events between and . However, it could be the case that this pattern is of interest only if the events occur contiguously in the stream, namely a temperature immediately after a humidity measure. For this purpose, reduces the set of outputs selecting only strictly consecutive complex events. Formally, for any CEL formula we have that holds if and for every , if then (i.e., is an interval). In our running example, would only produce , although and are also outputs for over .

NXT. The second selector, , is similar to the previously proposed operator skip-till-next-match [53]. The motivation behind this operator comes from a heuristic that consumes a stream skipping those events that cannot participate in the output, but matching patterns in a greedy manner that selects only the first event satisfying the next element of the query. In [53] the authors gave the definition informally as

a further relaxation is to remove the contiguity requirements: all irrelevant events will be skipped until the next relevant event is read” (*).

In practice, the definition of skip-till-next-match is given by a greedy evaluation algorithm that adds an event to the output whenever a sequential operator is used, or goes as far as possible adding events whenever an iteration operator is used. The fact that the semantics is only defined by an algorithm requires a user to understand the algorithm to write meaningful queries. In other words, this operator speeds up the evaluation by sacrificing the clarity of the semantics

To overcome the above problem, we formalize the intuition behind (*) based on a special order over complex events. As we will see later, this allows to speed up the evaluation process as much as skip-till-next-match while providing clear and intuitive semantics. Let and be complex events. The symmetric difference between and () is the set of all elements either in or but not in both. We say that if either or . For example, since the minimum element in is , which is in . Note that this is intuitively similar to skip-till-next-match, as we are selecting the first relevant event. An important property is that the -relation forms a total order among complex events, implying the existence of a minimum and a maximum over any finite set of complex events.

###### Lemma 1

is a total order between complex events.

We can define now the semantics of : for a CEL formula we have that if and for every complex event , if then . In our running example, when evaluating over we have that matches but does not. Furthermore, matches while and do not. Note that we compare outputs with respect to that have the same final position. This way, complex events are discarded only when there is a preferred complex event triggered by the same last event.

LAST. The selector is motivated by the computational benefit of skipping irrelevant events in a greedy fashion. However, from a semantic point of view it might not be what a user wants. For example, if we consider again and stream (Section 2), we know that every complex event in will have event . In this sense, the strategy selects the oldest complex event for the formula. We argue here that a user might actually prefer the opposite, i.e. the most recent explanation for the matching of a formula. This is the idea captured by . Formally, the selector is defined exactly as , but changing the order by : if and are two complex events, then if either or . For example, . In our running example, would select the most recent temperature and humidity that explain the matching of (i.e. ), which might be a better explanation for a possible fire. Surprisingly, we show in Section 7 that enjoys the same good computational properties as .

MAX. A more ambitious selection strategy is to keep all the maximal complex events in terms of set inclusion. This corresponds to obtaining those complex events that are as informative as possible, which could be naturally more useful for end users. Formally, given a CEL formula we say that holds iff and for all , if then . Coming back to our example , the selector will output both and , given that both complex events are maximal in terms of set inclusion. On the contrary, formula produced , , and . Then, if we evaluate over the same stream, we will obtain only as output, which is the maximal complex event. It is interesting to note that if we evaluate both and over the stream we will also get as the only output, illustrating that and also yield complex events with maximal information.

We have formally presented the foundations of a language for recognizing complex events, and how to restrict the outputs of this language in meaningful manners. In the following, we study practical aspects of the CEL syntax that impact how efficiently can formulas be evaluated.

## 5 Syntactic analysis of CEL

We now turn to study the syntactic form of CEL formulas. We define well-formed and safe formulas, which are syntactic restrictions that characterize semantic properties of interest. Then, we define a convenient normal form and show that any formula can be rewritten in this form.

### 5.1 Syntactic restrictions of formulas

Although CEL has well-defined semantics, there are some formulas whose semantics can be unintuitive because the use of variables is not restricted. Consider for example

 φ5 = (H AS x) FILTER (y.tmp≤30).

Here, will be naturally bound to the only element in a complex event, but will not add a new position to the output. By the semantics of CEL, a valuation for must assign a position for that satisfies the filter, but such position is not restricted to occur in the complex event. Moreover, is not necessarily bound to any of the events seen up to the last element, and thus a complex event could depend on future events. For example, if we evaluate over our running example (Figure 1), we have that , but this depends on the event at position . This means that to evaluate this formula we potentially need to inspect events that occur after all events composing the output complex event have been seen, an arguably undesired situation.

To avoid this problem, we introduce the notion of well-formed formulas. As the previous example illustrates, this requires defining where variables are bound by a sub-formula of the form . The set of bound variables of a formula is denoted by and is recursively defined as follows:

 bound(R AS x)={x}bound(ρ FILTER P(¯x))=bound(ρ)bound(ρ1 OR ρ2)=bound(ρ1)∩bound(ρ2)bound(ρ+)=∅bound(ρ1;ρ2)=bound(φ1)∪bound(φ2)bound(SEL(ρ))=bound(ρ)

where is any selection strategy. Note that for the operator a variable must be defined in both formulas in order to be bound. We say that a CEL formula is well-formed if for every sub-formula of the form and every , there is another sub-formula such that and is a sub-formula of . Note that this definition allows for including filters with variables defined in a wider scope. For example, formula in Section 2 is well-formed although it has the not-well-formed formula as a sub-formula.

One can argue that it would be desirable to restrict the users to only write well-formed formulas. Indeed, the well-formed property can be checked efficiently by a syntactic parser and users should understand that all variables in a formula must be correctly defined. Given that well-formed formulas have a well-defined variable structure, in the future we restrict our analysis to well-formed formulas.

Another issue for CEL is that the reuse of variables can easily produce unsatisfiable formulas. For example, the formula is not satisfiable (i.e. for every ) because variable cannot be assigned to two different positions in the stream. However, we do not want to be too conservative and disallow the reuse of variables in the whole formula (otherwise formulas like in Section 2 would not be permitted). This motivates the notion of safe CEL formulas. We say that a CEL formula is safe if for every sub-formula of the form it holds that . For example, all CEL formulas in this paper are safe except for the formula  above.

The safe notion is a mild restriction to help the evaluation of CEL, and can be easily checked during parsing time. However, safe formulas are a subclass of CEL and it could be the case that they do not capture the full language. We show in the next result that this is not the case. Formally, we say that two CEL formulas and are equivalent, denoted by , if for every stream and complex event , it is the case that if, and only if, .

###### Theorem 1

Given a core-CEL formula , there is a safe formula s.t. and is at most exponential in .

By this result, we can restrict our analysis to safe formulas without loss of generality. Unfortunately, we do not know if the exponential size of is necessary. We conjecture that this exponential blow-up is unavoidable, however, we do not know yet the corresponding lower bound.

### 5.2 LP-normal form

Now we study how to rewrite CEL formulas in order to simplify the evaluation of unary filters. Intuitively, filter operators in a CEL formula can become difficult to handle for a CEP query engine. To illustrate this, consider again formula in Section 2. Syntactically, this formula states “find an event followed by an event , and then check that they satisfy the filter conditions”. However, we would like an execution engine to only consider those events with that represent temperature above 40 degrees. Only afterwards the possible matching events should be considered. In other words, formula can be restated as:

 φ′1=[(T AS x) FILTER (x.tmp>40∧x.id=0)];[(H AS y) FILTER (y.hum<=25 ∧y.id=0)]

This example motivates defining the locally parametrized normal form (LP normal form). Let be the set of all predicates of arity 1 (i.e. ). We say that a formula is in LP-normal form if the following condition holds: for every sub-formula of , if , then for some and . In other words, all filters containing unary predicates are applied directly to the definitions of their variables. For instance, formula is in LP-normal form while formulas and are not. Note that non-unary predicates are not restricted, and they can be used anywhere in the formula.

One can easily see that having formulas in LP-normal form would be an advantage for an evaluation engine, because it can filter out some events as soon as they arrive (see Section 8 for further discussion). However, formulas that are not in LP-normal form can still be very useful for declaring patterns. To illustrate this, consider the formula:

Here, the operator works like a conditional statement: if the -temperature is greater than , then the following event should be a temperature, and a humidity event otherwise. This type of conditional statements can be very useful, but at the same time it can be hard to evaluate. Fortunately, the next result shows that one can always rewrite a formula into LP-normal form, incurring in the worst case in an exponential blow-up in the size of the formula.

###### Theorem 2

Let be a core-CEL formula. Then, there is a core-CEL formula in LP-normal form such that , and is at most exponential in .

The importance of this result and Theorem 1 will become clear in the next sections, where we show that safe formulas in LP-normal form have good properties for evaluation. Similar to Theorem 1, we do not know if the exponential blow-up is unavoidable and leave this for future work.

## 6 A computational model for CEL

In this section, we introduce a formal computational model for evaluating CEL formulas called complex event automata (CEA for short). Similar to classical database management systems (DBMS), it is useful to have a formal model that stands between the query language and the evaluation algorithms, in order to simplify the analysis and optimization of the whole evaluation process. There are several examples of DBMS that are based on this approach like regular expressions and finite state automata [37, 10], and relational algebra and SQL [7, 45]. Here, we propose CEA as the intermediate evaluation model for CEL and show later how to compile any (unary) CEL formula into a CEA.

As its name suggests, complex event automata (CEA) are an extension of Finite State Automata (FSA). The first difference from FSA comes from handling streams instead of words. A CEA is said to run over a stream of tuples, unlike FSA which run over words of a certain alphabet. The second difference arises directly from the first one by the need of processing tuples, which can have infinitely many different values, in contrast to the finite input alphabet of FSA. To handle this, our model is extended the same way as a Symbolic Finite Automata (SFA) [51]. SFAs are finite state automata in which the alphabet is described implicitly by a boolean algebra over the symbols. This allows automata to work with a possibly infinite alphabet and, at the same time, use finite state memory for processing the input. CEA are extended analogously, which is reflected in transitions labeled by unary predicates over tuples. The last difference addresses the need to generate complex events instead of boolean answers. A well known extension for FSA are Finite State Transducers [20], which are capable of producing an output whenever an input element is read. Our computational model follows the same approach: CEA are allowed to generate and output complex events when reading a stream.

Recall from Section 5 that is the subset of unary predicates of . Let be two symbols. A complex event automaton (CEA) is a tuple where is a finite set of states, is the transition relation, and are the set of initial and final states, respectively. Given a stream , a run of over is a sequence of transitions: such that , and for every . We say that is accepting if and . We denote by the set of accepting runs of over of length . Further, denotes the set of positions where the run marks the stream, namely . Intuitively this means that when a transition is taken, if the transition has the symbol then the current position of the stream is included in the output (similar to the execution of a transducer). Note that we require the last position of an accepting run to be marking, as otherwise an output could depend on future events (see the discussion about well-formed formulas in Section 5). Given a stream and , we define the set of complex events of over   at position  as and the set of all complex events as . Note that can be infinite, but is finite.

Consider as an example the CEA depicted in Figure 2. In this CEA, each transition marks one -tuple and each transition marks a sequence of -tuples with temperature bigger than . Note also that the transitions labeled by allow to arbitrarily skip tuples of the stream. Then, for every stream , represents the set of all complex events that begin and end with an -tuple and also contain some of the -tuples with temperature higher than .

It is important to stress that CEA are designed to be an evaluation model for the unary sub-fragment of CEL (a formal definition is presented in the next paragraph). Several computational models have been proposed for complex event processing [30, 44, 53, 48], but most of them are informal and non-standard extensions of finite state automata. In our framework, we want to give a step back compared to previous proposals and define a simple but powerful model that captures the regular core of CEL. With “regular” we mean all CEL formulas that can be evaluated with finite state memory. Intuitively, formulas like , and presented in Section 2 can be evaluated using a bounded amount of memory. In contrast, formula needs unbounded memory to store candidate events seen in the past, and thus, it calls for a more sophisticated model (e.g. data automata [49]). Of course one would like to have a full-fledged model for CEL, but to this end we must first understand the regular fragment. For these reasons, a computational model for the whole CEP logic is left as future work (see Section 9).

Compiling unary CEL into CEA. We now show how to compile a well-formed and unary CEL formula into an equivalent CEA . Formally, we say that a CEL formula is unary if for every subformula of of the form , it holds that is a unary predicate (i.e. ). For example, formulas , , and in Section 2 are unary, but formula is not (the predicate is binary). As motivated in Section 2 and 5.2, and further supported by our experiments (see Section 8), despite their appear simplicity unary formulas already present non-trivial computational challenges.

###### Theorem 3

For every well-formed formula in unary core-CEL, there is a CEA equivalent to . Furthermore, is of size at most linear in if is safe and in LP-normal form and at most double exponential in otherwise.

The proof of Theorem 3 is closely related with the safeness condition and the LP-normal form presented in Section 5. The construction goes by first converting into an equivalent CEL formula in LP-normal form (Theorem 2) and then building an equivalent CEA from . We show that there is an exponential blow-up for converting into LP-normal form. Furthermore, we show that the output of the second step is of linear size if is safe, and of exponential size otherwise, suggesting that restricting the language to safe formulas allows for more efficient evaluation.

So far we have described the compilation process without considering selection strategies. To include them, we need to extend our notation and allow selection strategies to be applied directly over CEA. Given a CEA , a selection strategy in and stream , the set of outputs is defined analogously to for a formula . Then, we say that a CEA is equivalent to if for every stream .

###### Theorem 4

Let be a selection strategy. For any CEA , there is a CEA equivalent to . Furthermore, the size of is, w.r.t. the size of , at most linear if , and at most exponential otherwise.

At first this result might seem unintuitive, specially in the case of , and . It is not immediate (and rather involved) to show that there exists a CEA for these strategies because they need to track an unbounded number of complex events using finite memory. Still, this can be done with an exponential blow-up in the number of states.

Theorem 4 concludes our study of the compilation of unary CEL into CEA. We have shown that CEA is not only able to evaluate CEL formulas, but also that it can be further exploited to evaluate selections strategies. We finish by introducing the notion of I/O-determinism that will be crucial for our evaluation algorithms in the next section.

I/O-deterministic CEA. To evaluate CEA in practice we will focus on the class of the so-called I/O-deterministic CEA (for Input/Output deterministic). We say that a CEA is I/O-deterministic if and for any two transitions and , either and are mutually exclusive (i.e. ), or . Intuitively, this notion imposes that given a stream and a complex event , there is at most one run over that generates (thus the name referencing the input and the output). In contrast, the classical notion of determinism would require that there is at most one run over the entire stream.

I/O-deterministic CEA are important because they allow for a simple and efficient evaluation algorithm (discussed in Sections 7 and 8). But for this algorithm to be useful, we need to make sure that every CEA can be I/O determinized. Formally, we say that two CEA and are equivalent (denoted ) if for every stream we have . Then we say that CEA are closed under I/O determinism if for every CEA there is an I/O-deterministic CEA such that .

###### Proposition 1

CEA are closed under I/O-determinism.

This result and the compilation process allow us to evaluate CEL formulas by means of I/O-deterministic CEA without loss of generality. In the next section we present an algorithm to perform this evaluation efficiently.

## 7 Algorithms for evaluating CEA

In this section we show how to efficiently evaluate a complex event automaton (CEA). We first formalize the notion of an efficient evaluation in the context of CEP and then provide algorithms to evaluate CEA efficiently.

### 7.1 Efficiency in CEP

Defining a notion of efficiency for CEP is challenging since we would like to compute complex events in one pass and using a restricted amount of resources. Streaming algorithms [38, 32] are a natural starting point as they usually restrict the time allowed to process each tuple and the space needed to process the first items of a stream (e.g., constant or logarithmic in ). However, an important difference is that in CEP the arrival of a single event might generate an exponential number of complex events as output. Therefore no algorithm producing this output could guarantee any sort of efficiency, because there are particular examples in which only generating the outputs take exponential time in size of the processed sub-stream. To overcome this problem, we propose to divide the evaluation in two parts: (1) consuming new events and updating the internal memory of the system and (2) generating complex events from the internal memory of the system. We require both parts to be as efficient as possible. First, (1) should process each event in a time that does not depend on the number of events seen in the past. Second, (2) should not spend any time processing and instead it should be completely devoted to generating the output. To formalize this notion, we assume that there is a special instruction that returns the next element of a stream . Then, given a function , a CEP evaluation algorithm with -update time is an algorithm that evaluates a CEA over a stream such that:

1. between any two calls to , the time spent is bounded by , where is the tuple returned by the first of such calls, and

2. maintains a data structure in memory, such that after calling times, the set can be enumerated from with constant delay.

The notion of constant-delay enumeration was defined in the database community [50, 18] precisely for defining efficiency whenever the output might be larger than the input. Formally, it requires the existence of a routine Enumerate that receives as input and outputs all complex events in without repetitions, while spending a constant amount of time before and after each output. Naturally, the time to generate a complex event must be linear in . We remark that (1) is a natural restriction imposed in the streaming literature [38], while (2) is the minimum requirement if an arbitrarily large set of arbitrarily large outputs must be produced [50].

Note that the update time is linear in if we consider that is fixed. Since this is the case in practice (i.e. the automaton is generally small with respect to the stream, and does not change during evaluation), this amounts to constant update time when measured under data complexity (tuples can also be considered of constant size).

### 7.2 Evaluation of I/O-deterministic CEA

We describe a CEP evaluation algorithm with update time for I/O-deterministic CEA. We define the algorithm’s underlying data structure, then show how to update this data structure upon new events, and finally how to enumerate the resulting complex events with constant delay.

Data structure. The atomic element in our data structure is the node. A node is defined as a pair , where represents a position in the stream and is a list of nodes. A node is initialized by calling , and the methods and return and , respectively.

The data structure maintained by our algorithm is composed by linked-lists of nodes. For operating a linked-list we use the methods , and . Specifically, adds the node at the beginning of , and appends a list at the end of . An important property of the data structure is that no element is ever removed from the lists, only adding nodes or appending lists is allowed. This allows us to represent a list as a pair , where is its starting node and its ending node. Then, returns a copy of , defined by the pointers , and the generated copy of the list is not affected by future changes on . Furthermore, it is trivial to see that runs in constant time (i.e. ). The methods used for navigating the list are and . gives a pointer to the first node of the list, and returns the next element of the list and when it reaches the end.

##### Evaluation

The CEP evaluation algorithm for an I/O-deterministic CEA is given in Algorithms 1 and 2. To ease the notation, we extend as a function that retrieves the (unique) state for some predicate such that ; if there is no such , it returns . Basically, if a run is in state , then is the state it moves when reading and marking .

The procedure Evaluate keeps the evaluation of by simulating all its possible runs, and has a list for each state to keep track on the complex events. Intuitively, each keeps the information of the partial complex events generated by the partial runs currently ending at . Each node in represents (through its ) a subset of these complex events, all of them having as their last position. These sets are pairwise disjoint (which is an important property for constant-delay enumeration of the output). Each is initialized as the empty list, represented by , except for , which begins with only the sink node in it. The algorithm then reads using to get each new event. For each new event , the procedure updates the data structure as follows. It starts by creating a copy of each , and storing it in (lines 7-8). Then, for each with non-empty it extends the runs that are currently at by simulating the possible outgoing transitions satisfied by (lines 9-13). After doing this for all , it calls the Enumerate procedure to enumerate all output complex events generated by .

The core processing of Algorithm 1 is in updating the structure by extending the runs currently at (lines 10-13). Specifically, line 11 considers the -transition and line 13 the -transition (recall that is I/O-deterministic). As we said before, each represents the complex events of runs currently at . To extend these runs with a -transition, line 11 creates a new node with the current position in (i.e. ) as its position, and the old value of as its predecessors list. Then, is added at the top of the new list of . On the other hand, to extend the runs with a -transition, it only needs to append the old list of to the list of (line 13).

By looking at Algorithm 1, one can see that the update of each takes time , and therefore for the whole update procedure. This, added to the of the lazy copying of the lists, gives us an overall bound on the time between each call to , satisfying condition  with .

Enumeration. One can consider the data structure maintained by Evaluate as a directed acyclic graph: vertices are nodes and there is an outgoing edge from node to node if appears in . By following Algorithm 1, one can easily check that the sink node is reachable from every node in this directed acyclic graph, namely, for any and any node in there exists a path . Furthermore, each of this path represents a complex event outputted by some run of over that ends at .

Given the previous discussion, the Enumerate procedure in Algorithm 2 is straightforward: it simply traverses the directed acyclic graph in a depth-first manner, computing a complex event for each path. To ensure that all outputs are enumerated, it needs to do this for each node in an accepting state and whose position is equal to the current position (i.e. ). Because new nodes are added on top, it iterates over each accepting list from the beginning, stopping whenever it finds a node with a position different from .

It is important to note that Enumerate does not satisfy condition (2) of a CEP evaluation algorithm, namely, taking a constant delay between two outputs. The problem relies in the depth-first search traversal of the acyclic graph: there can be an unbounded number of backtracking steps, creating a delay that is not constant between outputs. To solve this, one can use a stack with a smart policy to avoid these unbounded backtracking steps. Given space restrictions, we present this modification of Algorithm 2 in the appendix.

### 7.3 CEA and selection strategies

Given that any CEA can be I/O-determinized (Proposition 1), we can use Algorithms 1 and 2 to evaluate any CEA. Unfortunately, the determinization procedure has an exponential blow-up in the size of the automaton.

###### Theorem 5

For every CEA , there is an CEP evaluation algorithm with -update time.

We can further extend the CEP evaluation algorithm for I/O-deterministic CEA to any selection strategies by using the results of Theorem 4. However, by naively applying Theorem 4 and then I/O-determinizing the resulting automaton, we will have a double exponential blow-up in the update time. By doing the compilation of the selection strategies and the I/O-determinization together, we can lower the update time. Moreover, and rather surprisingly, we can evaluate and without determinizing the automaton, and therefore with linear update time.

###### Theorem 6

Let be a selection strategy. For any CEA , there is an CEP evaluation algorithm for . Furthermore, the update time is if , if and