Hypothetical answers to continuous queries over data streams

05/23/2019
by   Luís Cruz-Filipe, et al.
0

Continuous queries over data streams may suffer from blocking operations and/or unbound wait, which may delay answers until some relevant input arrives through the data stream. These delays may turn answers, when they arrive, obsolete to users who sometimes have to make decisions with no help whatsoever. Therefore, it can be useful to provide hypothetical answers - "given the current information, it is possible that X will become true at time t" - instead of no information at all. In this paper we present a semantics for queries and corresponding answers that covers such hypothetical answers, together with an online algorithm for updating the set of facts that are consistent with the currently available information.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

01/16/2018

Sequences, yet Functions: The Dual Nature of Data-Stream Processing

Data-stream processing has continuously risen in importance as the amoun...
05/10/2018

Computational Social Choice Meets Databases

We develop a novel framework that aims to create bridges between the com...
07/27/2020

Internal Quasiperiod Queries

Internal pattern matching requires one to answer queries about factors o...
05/07/2020

Détermination Automatique des Fonctions d'Appartenance et Interrogation Flexible et Coopérative des Bases de Données

Flexible querying of DB allows to extend DBMS in order to support imprec...
06/14/2021

z-anonymity: Zero-Delay Anonymization for Data Streams

With the advent of big data and the birth of the data markets that sell ...
02/05/2018

Learning from Richer Human Guidance: Augmenting Comparison-Based Learning with Feature Queries

We focus on learning the desired objective function for a robot. Althoug...
02/06/2013

Learning Bayesian Nets that Perform Well

A Bayesian net (BN) is more than a succinct way to encode a probabilisti...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Modern-day reasoning systems often have to react to real-time information about the real world provided by e.g. sensors. This information is typically conceptualized as a data stream, which is accessed by the reasoning system. The reasoning tasks associated to data streams – usually called continuous queries – are expected to run continuously and produce results through another data stream in an online fashion, as new elements arrive.

A data stream is a potentially unbounded sequence of data items generated by an active, uncontrolled data source. Elements arrive continuously at the system, potentially unordered, and at unpredictable rates. Thus, reasoning over data streams requires dealing with incomplete or missing data, potentially storing large amounts of data (in case it might be needed to answer future queries), and providing answers in timely fashion – among other problems, see e.g. [3, 25, 11].

The output stream is normally ordered by time, which implies that the system may have to delay appending some answer because of uncertainty in possible answers relating to earlier time points. The length of this delay may be unpredictable (unbound wait) or infinite, for example if the query uses operators that range over the whole input data stream (blocking operations). In these cases, answers that have been computed may never be output. An approach to avoid this problem is to restrict the language by forbidding blocking operations [26, 23]. Another approach uses the concept of reasoning window [7, 21], which bounds the size of the input that can be used for computing each output (either in time units or in number of events).

In several applications, it is useful to know that some answers are likely to be produced in the future, since there is already some information that might lead to their generation. This is the case namely in prognosis systems (e.g., medical diagnosis, stock market prediction), where one can prepare for the possibility of something happening. To this goal, we propose hypothetical answers: answers that are supported by information provided by the input stream, but that still depend on other facts being true in the future. Knowledge about both the facts that support the answer and possible future facts that may make it true gives users the possibility to make timely, informed decisions in contexts where preemptive measures may have to be taken.

Moreover, by giving such hypothetical answers to the user we cope with unbound wait in a constructive way, since the system is no longer “mute” while waiting for an answer to become definitive.

Many existing approaches to reasoning with data streams adapt and extend models, languages and techniques used for querying databases and the semantic web [2, 4]. We develop our theory in line with the works of [26, 7, 10, 21, 24]

, where continuous queries are treated as rules of a logic program that reasons over facts arriving through a data stream.

Contribution.

We present a declarative semantics for queries in Temporal Datalog [24], where we define the notions of hypothetical and supported answers. We also define an operational semantics based on SLD-resolution, and show that there is a natural connection between the answers computed by this semantics and hypothetical and supported answers. Finally, we refine SLD-resolution to obtain an online algorithm for maintaining and updating the set of answers that are consistent with the currently available information.

Structure.

Section 2 revisits some fundamental background notions, namely the formalism from [24], which we extend in this paper, and introduces the running example that we use throughout this article. Section 3 introduces our declarative semantics for continuous queries, defining hypothetical and supported answers, and relates these concepts with the standard definitions of answers. Section 4 presents our operational semantics for continuous queries and relates it to the declarative semantics. Section 5 details our online algorithm to compute supported answers incrementally, as input facts arrive through the data stream, and proves it sound and complete. Section 6 briefly compares our proposal to similar ones in the literature, and Section 7 concludes and presents further work.

2 Background

In this section we review the most relevant concepts for our work.

2.1 Continuous queries in Temporal Datalog

We use the framework from [24] to write continuous queries over datastreams, slightly adapting some definitions. We work in Temporal Datalog, the fragment of negation-free Datalog extended with the special temporal sort from [9], which is isomorphic to the set of natural numbers equipped with addition with arbitrary constants.

Syntax of Temporal Datalog.

A vocabulary consists of constants (numbers or identifiers in lowercase), variables (single uppercase letters) and predicate symbols (identifiers beginning with an uppercase letter). All these may be indexed if necessary; occurrences of predicates and variables are distinguished by context. In examples, we use words in sans serif for concrete constants and predicates.

Constants and variables have one of two sorts: object or temporal. An object term is either an object (constant) or an object variable. A time term is either a natural number (called a time point or temporal constant), a time variable, or an expression of the form where is a time variable and is an integer.

Predicates can take at most one temporal parameter, which we assume to be the last one (if present). A predicate with no temporal parameters is called rigid, otherwise it is called temporal. An atom is an expression where is a predicate and each is a term of the expected sort.

A rule has the form , where and each are rigid or temporal atoms. Atom is called the head of the rule, and the body. Rules are assumed to be safe: each variable in the head must occur in the body. A program is a set of rules.

A predicate symbol that occurs in an atom in the head of a rule with non-empty body is called intensional (IDB predicate). Predicates that are defined only through rules with empty body are called extensional (EDB predicates). An atom is extensional (EDB atom) or intensional (IDB atom) according to whether is extensional or intensional.

A term, atom, rule, or program is ground if it contains no variables. We write for the set of variables occurring in an atom, and extend this function homomorphically to rules and sets. A fact is a function-free ground atom; since Temporal Datalog does not allow function symbols except in temporal terms, every ground rigid atom is a fact.

Rules are instantiated by means of substitutions, which are functions mapping variables to terms of the expected sort. The support of a substitution is the set . We consider only substitutions with finite support, and write for the substition mapping each variable to the term , and leaving all remaining variables unchanged. A substitution is ground if every variable in its support is mapped to a constant. An instance of a rule is obtained by simultaneously replacing every variable in by and computing any additions of temporal constants.

A query is a pair where is a program and is an IDB atom in the language underlying . Query is temporal (respectively, rigid) if the predicate in is a temporal (resp. rigid) predicate. (Note that we do not require to be ground.)

A dataset is a set of EDB facts (input facts), intuitively produced by a data stream. For each dataset and time point , we consider ’s -history: the dataset of the facts produced by whose temporal argument is at most . By convention, .

Semantics.

The semantics of Temporal Datalog is a variant of the standard semantics based on Herbrand models. A Herbrand interpretation for Temporal Datalog is a set of facts. If is an atom with no variables, then we define as the fact obtained from by evaluating each temporal term. In particular, if is rigid, then . We say that satisfies , , if . The extension of the notion of satisfaction to the whole language follows the standard construction, and the definition of entailment is the standard one.

An answer to a query over a dataset is a ground substitution whose domain is the set of variables in , satisfying . In the context of continuous query answering, we are interested in the case where is a -history of some data stream, which changes with time. We denote the set of all answers to over as .

We use a subset of Example 1 in [24] as running example throughout our paper.

Example 1

A set of wind turbines are scattered throughout the North Sea. Each turbine has a sensor that sends temperature readings to a data centre. The data centre tracks activation of cooling measures in each turbine, recording malfunctions and shutdowns by means of the following program .

Consider the query . If the history consists of the single fact , then at time instant there is no output for . If arrives to , then , and there still is no answer to . Finally, the arrival of to yields , allowing us to infer . Then .

Throughtout this work, we do not distinguish between the temporal argument in a fact (corresponding to the timepoint where it is produced) and the instant when it arrives in . In other words, we assume that at each time point , the -history contains all EDB facts about time instants .

2.2 SLD-resolution

We also review some concepts from SLD-resolution.

A literal is an atom or its negation. Atoms are also called positive literals, and a negated atom is a negative literal. A definite clause is a disjunction of literals containing at most one positive literal. In the case where all literals are negative, the clause is a goal. We use the standard rule notation for writing definite clauses.

Definition 1

Given two substitutions and , their composition is obtained from

by (i) deleting any binding where and (ii) deleting any binding where .

For every atom , .

Definition 2

Two atomic formulas and are unifiable if there exists a substitution such that .

A unifier of and is called a most general unifier (mgu) if for each unifier of and there exists a substitution such that .

It is well known that there always exist several mgus of any two unifiable atoms, and that they are unique up to renaming of variables.

Recall that a goal is a clause of the form . If is a rule , is a goal with , and is an mgu of and , then the resolvent of and is the goal .

If is a program and is a goal, an SLD-derivation of is a (finite or infinite) sequence of goals with , a sequence of -renamings of program clauses of and a sequence of substitutions such that is the resolvent of and using . A finite SLD-derivation of where the last goal is a contradiction () is called an SLD-refutation of of length , and the substitution obtained by restricting the composition of to the variables occurring in is called a computed answer of .

3 Hypothetical answers

In our running example, being produced at time instant yields some evidence that may turn out to be true. At time instant , we may receive further evidence as in the example (the arrival of ), or we might find out that this fact will not be true (if does not arrive).

We propose a theory where such hypothetical answers to a continuous query are output: if some substitution can become an answer as long as some facts in the future are true, then we output this information. In this way we can lessen the negative effects of unbound wait. Hypothetical answers can also refer to future time points: in our example, would also be output at time point 0 as a substitution that may prove to be an answer to the query when further information arrives.

Our formalism uses ideas from multi-valued logic, where some substitutions correspond to answers (true), others are known not to be answers (false), and others are consistent with the available data, but can not yet be shown to be true or false. In our example, the fact is consistent with the data at time point , and thus “possible”; it is also consistent with the data at time point , and thus “more possible”; and it finally becomes (known to be) true at time point 2.

As already motivated, we want answers to give us not only the substitutions that make the query goal true, but also ones that make the query goal possible in the following sense: they depend both on past and future facts, and the past facts are already known.

For the remainder of the article, we assume fixed a query , a data stream and a time instant .

Definition 3

A hypothetical answer to query over is a pair , where is a substitution and is a finite set of ground EDB temporal atoms (the hypotheses) such that:

  • ;

  • only contains atoms with time stamp ;

  • ;

  • is minimal with respect to set inclusion.

is the set of hypothetical answers to over .

Intuitively, a hypothetical answer states that holds if all facts in are ever produced by the data stream. Thus, is currently backed up by the information available. In particular, if then is an answer in the standard sense (it is a known fact).

Proposition 1

If , then .

Proof. if . When , this reduces to , which coincides with the definition of answer.

We can generalize this proposition, formalizing the intuition we gave for the definition of hypothetical answer.

Proposition 2

If , then there exist a time point and a data stream such that and .

Proof. Let be the data stream and be the highest timestamp occurring in . It is straightforward to verify that satisfies the thesis.

Example 2

We illustrate these concepts in the context of Example 1. Consider the substitution . Then , where

Since includes the additional fact , we also have with . Finally, . This answer has no hypotheses, and indeed .

Take for another constant . Then also e.g. with , but since there is no element for .

Hypothetical answers where can be further split into two kinds: those that are supported by some present or past true fact(s), and those for which there is no evidence whatsover – they only depend on future, unknown facts. For the former, : they rely on some fact from . This is the class of answers that interests us, as there is non-trivial information in saying that they may become true.

Definition 4

A non-empty set of facts is evidence supporting if is a minimal set satisfying . A supported answer to over is a triple such that is evidence supporting .

is the set of supported answers to over .

Since set inclusion is well-founded, if and , then there exists a set such that is a supported answer to over . However, in general, several such sets may exist. As a consequence, Propositions 1 and 2 generalize to supported answers in the obvious way.

Example 3

Consider the hypothetical answers from Example 2. The hypothetical answer is supported by the evidence

while is supported by

However, there is no evidence for , so this answer is not supported.

This example illustrates that unsupported hypothetical answers are not very informative: it is the existence of supporting evidence that distinguishes interesting hypothetical answers from any arbitrary future fact.

However, it is useful to consider even unsupported hypothetical answers in order to develop incremental algorithms to compute supported answers: the sequence of sets is non-monotonic, as at every time point new unsupported hypothetical answers may get evidence and supported hypothetical answers may get rejected. The sequence , on the other hand, is anti-monotonic, as the following results show.

Proposition 3

If , then there exists such that and . Furthermore, if , then .

Proof. Recall that by convention. If , then . Since is finite and set inclusion is well-founded, there is a minimal subset of with the property that and . Clearly .

Assume that is also such that . Then ; but , so and therefore also . By definition of , this implies that , hence .

Finally, if is evidence supporting , then , hence , since .

Proposition 4

If and , then there exists such that .

Proof. Just as the proof of the previous proposition, but dividing into and instead of into and .

Examples 2 and 3 also illustrate this property, with hypotheses turning into evidence as time progresses. Since , Proposition 3 is a particular case of Proposition 4.

In the next sections we show how to compute hypothetical answers and the corresponding sets of evidence for a given continuous query.

4 Operational semantics via SLD-resolution

The definitions of hypothetical and supported answers are declarative. We now show how SLD-resolution can be adapted to algorithms that compute these answers. We use standard results about SLD-resolution, see for example [18].

We begin with a simple observation: since the only function symbol in our language is addition of temporal parameters (which is invertible), we can always choose mgus that do not replace variables in the goal with new ones.

Lemma 5

Let be a goal and be a rule such that is unifiable with for some . Then there is an mgu of and such that all variables occurring in also occur in .

Proof. Let be an mgu of and . For each , can either be a variable or a time expression . First, iteratively build a substitution as follows: for , if occurs in but does not and does not yet include a replacement for the variable in , extend with , if is , or , if is .

We now show that is an mgu of and with the desired property. If , then either (i)  is and is for some or (ii) . In case (i), by construction of if occurs in but includes a variable not in , then replaces that variable with a term using only variables in . In case (ii), by construction does not occur in .

To show that is an mgu of and it suffices to observe that is invertible, with

Without loss of generality, we assume that the mgus in the SLD-derivations we consider are chosen to have the property in Lemma 5.

In classical SLD-resolution, derivations must end in the empty clause. We relax this by allowing derivations to end with a goal if: this goal only refers to EDB predicates and all the temporal terms in it refer to future instants (possibly after further instantiation). This makes the notion of derivation also dependent on a time parameter.

Definition 5

An atom is a future atom wrt if is a temporal predicate and the time term either contains a temporal variable or is a time instant .

Definition 6

An SLD-refutation with future premises of over is a finite SLD-derivation of whose last goal only contains future EDB atoms wrt .

If is an SLD-refutation with future premises of over with last goal and is the substitution obtained by restricting the composition of the mgus in to , then is a computed answer with premises to over , denoted .

Example 4

Consider the query from Example 1 and let . There is an SLD-derivation of ending with the goal , which is a future EDB atom with respect to . Thus, with .

Computed answers with premises are the operational counterpart to hypothetical answers, with two caveats. First, a computed answer with premises need not be ground: there may be some universally quantified variables in the last goal. Second, may contain redundant conjuncts, in the sense that they might not be needed to establish the goal. We briefly illustrate these two features.

Example 5

Continuing with our running example, there is also an SLD-derivation of ending with the goal , which only contains future EDB atoms wrt . Thus also .

Example 6

Consider the program

and the query .

Let . There is an SLD-derivation of ending with the goal , which only contains future EDB atoms wrt . Thus

However, atom is redundant, since alone suffices to make an answer to for any .

(Observe that also , but from a different SLD-derivation.)

We now look at the relationship between the operational definition of computed answer with premises and the notion of hypothetical answer. The examples above show that these notions do not precisely correspond. However, we can show that computed answers with premises approximate hypothetical answers and that, conversely, every hypothetical answer is a grounded instance of a computed answer with premises.

Proposition 6 (Soundness)

If and is a ground substitution such that and for every temporal term occurring in , then there is a set such that .

Proof. Assume that there is some SLD-refutation with future premises of over . Then this is an SLD-derivation whose last goal only contains future EDB atoms with respect to . Let be any substitution in the conditions of the hypothesis. Taking , we can extend this SLD-derivation to a (standard) SLD-refutation for , by resolving with each of the in turn. The computed answer is then the restriction of to . By soundness of SLD-resolution, . Since set inclusion is well-founded, we can find a minimal set with the latter property.

Proposition 7 (Completeness)

If , then there exist substitutions and and a finite set of atoms such that , and .

Proof. Suppose . Then . By completeness of SLD-resolution, there exist substitutions and and an SLD-derivation for with computed answer such that .

By minimality of , for each there must exist a step in this SLD-derivation where the current goal is resolved with . Without loss of generality, we can assume that these are the last steps in the derivation (by independence of the computation rule). Let be the derivation consisting only of these steps, and be the original derivation without . Let be the answer computed by and be its last goal, let be the answer computed by , and define . Then:

  • Let ; then occurs in . If occurs in , then by construction . If is a ground term or does not occur in , then trivially since does not change . In either case, .

  • : by construction of , , and since is ground for each , it is also equal to .

The derivation shows that .

All notions introduced in this section depend on the time parameter , and in particular on the history dataset . In the next section, we explore the idea of “organizing” the SLD-derivation in an adequate way to pre-process independently of , so that the computation of (hypothetical) answers can be split into an offline part and a less expensive online part.

5 Incremental computation of hypothetical answers

Proposition 4 states that the set of hypothetical answers evolves as time passes, with hypothetical answers either gaining evidence and becoming query answers or being put aside due to their dependence on facts that turn out not to be true.

In this section, we show how we can use this temporal evolution to compute supported answers incrementally. We start by revisiting SLD-derivations and showing how they can reflect this temporal structure.

Proposition 8

If , then there exist an SLD-refutation with future premises of over computing and a sequence such that:

  • goals are obtained by resolving with clauses from ;

  • for , goals are obtained by resolving with clauses from .

Proof. Straightforward corollary of the independence of the computation rule.

An SLD-refutation with future premises with the property guaranteed by Proposition 8 is called a stratified SLD-refutation with future premises. Since data stream only contains EDB atoms, it also follows that in a stratified SLD-refutation all goals after are always resolved with EDB atoms. Furthermore, each contains only future EDB atoms with respect to . Let be the restriction of the composition of all substitutions in the SLD-derivation up to step to . Then represents all hypothetical answers to over of the form for some ground substitution (cf. Proposition 6).

This yields an online procedure to compute supported answers. In a pre-processing step, we calculate all computed answers with premises to over , and keep the ones with minimal set of formulas. (Note that Proposition 7 guarantees that all minimal sets are generated by this procedure, although some non-minimal sets may also appear as in Example 5.) The online part of the procedure then performs SLD-resolution between each of these sets and the facts produced by the data stream, adding the resulting resolvents to a set of schemata111since they may include variables of supported answers. (By Proposition 8, if there is at least one resolution step at this stage, then the hypothetical answers represented by these schemata all have evidence, so they are indeed supported.)

In general, the pre-processing step of this procedure may not terminate, as the following example illustrates.

Example 7

Consider the following program , where is an extensional predicate and is an intensional predicate.

If is produced by the datastream, then is true for every .

Thus, for all . The preprocessing step needs to output this infinite set, so it cannot terminate.

We show that for a particular class of queries discussed in [24] the preprocessing step terminates. A query is connected if each rule in contains at most one temporal variable, which occurs in the head whenever it occurs in the body; and it is nonrecursive if the directed graph induced by its dependencies is acyclic.

Proposition 9

Let be a nonrecursive and connected query. Then the set of all computed answers with premises to over can be computed in finite time.

Proof. Let be the (only) temporal variable in . Then all SLD-derivations for have a maximum depth: if we associate to each predicate the length of the maximum path in the dependency graph for starting from it and to each goal the sorted sequence of such values for each of its atoms, then each resolution step decreases this sequence with respect to the lexicographic ordering. Since this ordering is well-founded, the SLD-derivation must terminate.

Furthermore, since is finite, there is a finite number of possible descendants for each node. Therefore, the tree containing all possible SLD-derivations for is a finite branching tree with finite height, and by König’s Lemma it is finite.

Since each resolution step terminates (possibly with failure) in finite time, this tree can be built in finite time.

The algorithm implicit in the proof of Proposition 9 can be improved by standard techniques (e.g. by keeping track of generated nodes to avoid duplicates). However, since it is a pre-processing step that is done offline and only once, we do not discuss such optimizations.

By running this algorithm, we can compute a finite set of preconditions for that represents : for each computed answer with premises to over where is minimal, contains an entry where is the subset of the with minimal timestamp (i.e. those elements of whose temporal variable is with minimal ).

Each tuple represents the set of all hypothetical answers as in Proposition 6.

We now show that computing and updating the set can be done efficiently. This set is maintained again as a set of schematic supported answers (i.e. where variables may occur). We continue to assume that is a nonrecursive and connected query.

Proposition 10

The following algorithm computes from and in time polynomial in the size of , and .

  1. For each and each computed answer to , add to . (Observe that all time variables in are instantiated in .)

  2. For each , compute the set of atoms with timestamp . For each computed answer to , add to .

Proof. To show that this algorithm runs in polynomial time in the size of , and , note that the size of every SLD-derivation that needs to be constructed is bound by the number of atoms in the initial goal, since only contains facts. Furthermore, all unifiers can be constructed in time linear in the size of the formulas involved, since the only function symbol available is addition of temporal terms. Finally, the total number of SLD-derivations that needs to be considered is bound by the number of elements of .

Example 8

We illustrate this mechanism with our running example. The set contains

From , we obtain the substitution from SLD-resolution between and (step 1). Therefore, contains

Next, . This is the only element of with timestamp . By step 2, contains

Furthermore, from we also add (step 1)

to , with .

Next, . This is the only atom with timestamp in the premises of both elements of , so contains (step 2)

and

From we also get (step 1)

with .

If , then the premises for and become unsatisfied, and no new supported answers are generated from . Thus

The following example also illustrates that, by outputting hypothetical answers, we can answer queries earlier than in other formalisms.

Example 9

Suppose that we extend the program in our running example with the following rule (as in Example 2 from [24]).

If , then

Thus, the answer is produced at timepoint 1, rather than being delayed until it is known whether is an answer.

Proposition 11 (Soundness)

If and instantiates all free variables in , then .

Proof. By induction on , we show that  with . If is obtained from an element in and , then this derivation is obtained by composing the derivation for generating the relevant element of with the one for . If is obtained from an element of and , then this derivation is obtained by composing the derivation obtained by induction hypothesis to the one used for deriving .

By applying Proposition 6 to this SLD-derivation, we conclude that . Furthermore, and is evidence for this answer by construction.

Proposition 12 (Completeness)

If , then there exist a substitution and a triple such that , and .

Proof. By Proposition 7,