Assessing the quality of data and performing data cleaning when the data are not up to the expected standards of quality have been and will continue being common, difficult and costly problems in data management (Batini & Scannapieco, 2006; Eckerson, 2002; Redman, 1998). This is due, among other factors, to the fact that there is no uniform, general definition of quality data. Actually, data quality has several dimensions. Some of them are (Batini & Scannapieco, 2006): (1) Consistency, which refers to the validity and integrity of data representing real-world entities, typically identified with satisfaction of integrity constraints. (2) Currency (or timeliness), which aims to identify the current values of entities represented by tuples in a (possibly stale) database, and to answer queries with the current values. (3) Accuracy, which refers to the closeness of values in a database to the true values for the entities that the data in the database represents; and (4) Completeness, which is characterized in terms of the presence/absence of values. (5) Redundancy, e.g. multiple representations of external entities or of certain aspects thereof. Etc. (Cf. also (Jiang et al., 2008; Fan, 2015; Fan & Geerts, 2012) for more on quality dimensions.)
In this work we consider data quality as referring to the degree to which the data fits or fulfills a form of usage (Batini & Scannapieco, 2006), relating our data quality concerns to the production and the use of data. We will elaborate more on this after the motivating example in this introduction.
Independently from the quality dimension we may consider, data quality assessment and data cleaning are context-dependent activities. This is our starting point, and the one leading our research. In more concrete terms, the quality of data has to be assessed with some form of contextual knowledge; and whatever we do with the data in the direction of data cleaning also depends on contextual knowledge. For example, contextual knowledge can tell us if the data we have is incomplete or inconsistent. In the latter case, the context knowledge is provided by explicit semantic constraints.
In order to address contextual data quality issues, we need a formal model of context. In very general terms, the big picture is as follows. A database can be seen as a logical theory, , and a context for it, as another logical theory, , into which is mapped by means of a set, , of logical mappings, as shown in Figure 1. The image of in is , which could be seen as an interpretation of in .111 Interpretations between logical theories have been investigated in mathematical logic (Enderton, 2001, sec. 2.7) and used, e.g. to obtain (un)decidability results (Rabin, 1965). The contextual theory provides extra knowledge about , as a logical extension of its image . For example, may contain additional semantic constraints on elements of (or their images in ) or extensions of their definitions. In this way, conveys more semantics or meaning about , contributing to making more sense of ’s elements. may also contain data and logical rules that can be used for further processing or using knowledge in . The embedding of into can be achieved via predicates in common or, more complex logical formulas.
In this work, building upon and considerably extending the framework in (Bertossi et al., 2011a, 2016), context-based data quality assessment, quality data extraction and data cleaning on a relational database for a relational schema are approached by creating a context model where is the theory above (it could be expressed as a logical theory (Reiter, 1984)), the theory is a (logical) ontology ; and, considering that we are using theories around data, the mappings can be logical mappings as used in virtual data integration (Lenzerini, 2002) or data exchange (Barcelo, 2009). In this work, the mappings turn out to be quite simple: The ontology contains, among other predicates, nicknames for the predicates in (i.e. copies of them), so that each predicate in is directly mapped to its copy in .
Once the data in is mapped into , i.e. put in context, the extra elements in it can be used to define alternative versions of , in our case, clean or quality versions, , of in terms of data quality. The data quality criteria are imposed within . This may determine a class of possible quality versions of , virtual or material. The existence of several quality versions reflects the uncertainty that emerges from not having only quality data in .
The whole class, , of quality versions of determines or characterizes the quality data in , as the data that are certain with respect to . One way to go in this direction consists in keeping only the data that are found in the intersection of all the instances in . A more relaxed alternative consists in considering as quality data those that are obtained as certain answers to queries posed to , but answered through : The query is posed to each of the instances in (which essentially have the same schema as ), but only those answers that are shared by those instances are considered to be certain (Imielinski & Lipski, 1984).222 Those familiar with database repairs and consistent query answering (Bertossi, 2011b, 2006), would notice that both can be formulated in this general stetting. Instance would be the inconsistent database, the ontology would provide the integrity constraints and the specification of repairs, say in answer set programming (Caniupan and Bertossi, 2010), the class would contain the repairs, and the general certain answers would become the consistent answers. These answers become the quality-answers in our setting.
The main question is about the kind of contextual ontologies that are appropriate for our tasks. There are several basic conditions to satisfy. First of all, has to be written in a logical language. As a theory it has to be expressive enough, but not too much so that computational problems, such as (quality) data extraction via queries becomes intractable, if not impossible. It also has to combine well with relational data. And, as we emphasize and exploit in our work, it has to allow for the representation and use of dimensions of data, i.e. conceptual axes along which data are represented and analyzed. They are the basic elements in multidimensional databases and data warehouses (Jensen et al., 2010), where we usually find time, location, product, as three dimensions that give context to numerical data, e.g. sales. Dimensions are almost essential elements of contexts, in general, and crucial if we want to analyze data from different perspectives or points of view. We use dimensions as (possibly partially ordered) hierarchies of categories.333 Data dimensions were not considered in (Bertossi et al., 2011a, 2016). For example, the location dimension could have categories, city, province, country, continent, in this hierarchical order of abstraction.
The language of choice for the contextual ontologies will be Datalog (Calì et al., 2009). As an extension of Datalog, a declarative query language for relational databases (Ceri et al., 1990), it provides declarative extensions of relational data by means of expressive rules and semantic constraints. Certain classes of Datalog programs have non-trivial expressive power and good computational properties at the same time. One of those good classes is that of weakly-sticky Datalog (Calì et al., 2012c). Programs in that class allow us to represent a logic-based, relational reconstruction and extension of the Hurtado-Mendelzon multidimensional data model (Hurtado & Mendelzon, 2002; Hurtado et al., 2005), which allows us to bring data dimensions into contexts.
Every contextual ontology contains its multidimensional core ontology, , which is written in Datalog and represents what we will call the ontological multidimensional data model (OMD model, in short), plus a quality-oriented sub-ontology, , containing extra relational data (shown as instance in Figure 5), Datalog rules, and possibly additional constraints. Both sub-ontologies are application dependent, but follows a relatively fixed format, and contains the dimensional structure and data that extend and supplement the data in the input instance , without any explicit quality concerns in it. The OMD model is interesting per se in that it considerably extends the usual multidimensional data models (more on this later). Ontology contains as main elements definitions of quality predicates that will be used to produce quality versions of the original tables, and to compute quality query answers. Notice that the latter problem becomes a case of ontology-based data access (OBDA), i.e. about indirectly accessing underlying data through queries posed to the interface and elements of an ontology (Poggi et al., 2008).
Example 1.1 ().
The relational table Temperatures (Table 1) shows body temperatures of patients in an institution. A doctor wants to know “The body temperatures of Tom Waits for August 21 taken around noon with a thermometer of brand ” (as he expected). Possibly a nurse, unaware of this requirement, used a thermometer of brand , storing the data in Temperatures. In this case, not all the temperature measurements in the table are up to the expected quality. However, table Temperatures alone does not discriminate between intended values (those taken with brand ) and the others.
For assessing the quality of the data or for extracting quality data in/from the table Temperatures according to the doctor’s quality requirement, extra contextual information about the thermometers in use may help. In this case, we may have contextual information in the form of a guideline prescribing that: “Nurses in intensive care unit use thermometers of Brand ”. We still cannot combine this guideline with the data in the table. However, if we know that nurses work in wards, and those wards are associated to units, then we may be in position to combine the table with the given contextual information. Actually, as shown in Figure 4, the context contains dimensional data, in categorical relations linked to dimensions.
In it we find two dimensions, , on the left-hand side, and , on the right-hand side. For example, the dimension’s instance is found in Figure 3. In the middle of Figure 4 we find categorical relations (shown as solid tables and initially excluding the two rows shaded in gray at the bottom of the top table). They are associated to categories in the dimensions.
Now we have all the necessary information to discriminate between quality and non-quality entries in Table 1: Nurses appearing in it are associated to wards, as shown in table Shifts; and the wards are associated to units, as shown in Figure 3. Table WorkSchedules may be incomplete, and new -possibly virtual- entries can be produced for it, showing Helen and Sara working for the Standard and Intensive units, resp. (These correspond to the two (potential) extra, shaded tuples in Figure 4.) This is done by upward navigation and data propagation through the dimension hierarchy. At this point we are in position to take advantage of the guideline, inferring that Alan and Sara used thermometers of brand , as expected by the physician.
As expected, in order to do upward navigation and use the guideline, they have to be represented in our multidimensional contextual ontology. Accordingly, the latter contains, in addition to the data in Figure 4, the two rules, for upward data propagation and the guideline, resp.:
Here, is a categorical relation linked to the Time category in the Temporal dimension. It contains the schedules as in relation WorkSchedules, but at the time of the day level, say “14:30 on Feb/08, 2017”, rather than the day level.
Rule (1) tells that: “If a nurse has shifts in a ward on a specific day, he/she has a work schedule in the unit of that ward on the same day”. Notice that the use of (1) introduces unknown, actually null, values in attribute Specialization, which is due to the existential variable ranging over the attribute domain. Existential rules of this kind already make us depart from classic Datalog, taking us to Datalog.
Also notice that in (1) we are making use of the binary dimensional predicate WardUnit that represents in the ontology the child-parent relation between members of the Ward and Unit categories.444 In the predicate, attributes Unit and Day are called categorical attributes, because they take values from categories in dimension. They are separated by a semi-colon () from the non-categorical Nurse and Speciality.
Rule (1) properly belong to a contextual, multidimensional, core ontology in the sense that it describes properly dimensional information. Now, rule (2), the guideline, could also belong to , but it is less clear that it convey strictly dimensional information. Actually, in our case we intend to use it for data quality purposes (cf. Example 1.2 below), and as such we will place it in the quality-oriented ontology . In any case, the separation is always application dependent. However, under certain conditions on the contents of , we will be able to guarantee (in Section 4) that the latter has good computational properties.
The contextual ontology can be used to support the specification and extraction of quality data, as shown in Figure 5. A database instance for a relational schema is mapped into for quality data specification and extraction. The ontology contains a copy, , of schema with predicates that are nicknames for those in . The nickname predicates are directly populated with the data in the corresponding relations (tables) in .
In addition to the multidimensional (MD) ontology, , the contextual ontology contains, in ontology , definitions of application-dependent quality predicates , those in . Together with application-dependent, not directly dimensional rules, e.g. capturing guidelines as in Example 1.1, they capture data quality concerns. Figure 5 also shows as a possible extra contextual database, with schema , whose data could be used at the contextual level in combination with the data strictly associated to the multidimensional ontology (cf. Section 5 for more details).
Data originally obtained from is processed through the contextual ontology, producing, possibly virtual, extensions for copies, , of the original predicates in . Predicates are the “quality versions” of predicates . The following example shows the gist.
Example 1.2 ().
(ex. 1.1 cont.) , the nickname for predicate in the original instance, is defined by the rule:
Furthermore, contains rule (2) as a definition of quality predicate . Now, , the quality-version of predicate , is defined by means of:
The extension of can be computed, and is shown in Table 2. It contains “quality data” from the original relation . The second and the third tuples in are obtained through the fact that Sara was in the intensive care unit on Aug/21, as reported by the last in-gray shaded tuple in WorkSchedules in Figure 4, which was created by upward data propagation with the dimensional ontology.
It is not mandatory to materialize relation . Actually, the doctor’s query:
Due to the simple ontological rules and the use of them in the example above, we obtain a single quality instance. In other cases, we may obtain several of them, and quality query answering amounts to doing certain query answering (QA) on Datalog ontologies, in particular on the the MD ontologies. Query answering on Datalog ontologies has been investigated in the literature for different classes of Datalog programs. For some of them, query answering is tractable and there are efficient algorithms. For others, the problem is know to be tractable, but still practical algorithms are needed. For some classes, the problem is known to be intractable. For this reason, it becomes important to characterize the kind of Datalog ontologies used for the OMD model.
The promising application of the OMD model that we investigate in this work is related to data quality concerns as pertaining to the use and production of data (Batini & Scannapieco, 2006). By this we mean that the available data are not bad or good a priori or prima facie, but their quality depends on how they were created or how they will be used, and this information is obtained from a contextual ontology. This form of data quality has been mentioned in the literature. For example, in (Wang & Strong, 1996) contextual data quality dimensions are described as those quality dimensions that are relevant to the context of data usage. In (Herzog et al., 2009) and (Juran & Godfrey, 1999), quality is characterized as “fitness for use”.
Our motivating example already shows the gist of our approach to this form of data quality: Nothing looks wrong with the data in Table 1 (the data source), but in order to assess the quality of the source’s data or to extract quality data from it, we need to provide additional data and knowledge that do not exist at the source; and they are both provided by the context. From this point of view, we are implicitly addressing a problem of incomplete data, one of the common data quality dimensions (Batini & Scannapieco, 2006). However, this is not the form of explicit incompleteness that we face with null or missing values in a table (Libkin, 2014). (Cf. Section 6.3 for an additional discussion on the data quality dimensions addressed by the OMD model.)
As we pointed out before (cf. Footnote 2), our contextual approach can be used, depending on the elements we introduce in a contextual ontology, to address other data quality concerns, such as inconsistency, redundancy,555 In the case of duplicate records in a data source, the context could contain an answer set program or a Datalog program to enforce matching dependencies for entity resolution (Bahmani et al., 2012). and the more typical and direct form of incompleteness, say obtaining from the context data values for null or missing values in tables.
In this work we concentrate mostly on the OMD model by itself, but also on its combination and use with quality-oriented ontologies for quality QA. We do not go into data quality assessment, which is also an interesting subject.666 The quality of can be measured in terms of how much departs from (its quality versions in) : . Of course, different distance measures may be used for this purpose (Bertossi et al., 2011a, 2016). Next, we summarize the main contributions of this work.
(A) We propose and formalize the Ontological Multidimensional Data Model (OMD model), which is based on a relational extension via Datalog of the HM model for multidimensional data. The OMD allows for: (a) Categorical relations linked to dimension categories (at any level), which go beyond the bottom-level, numerical fact tables found in data warehouses. (b) Incomplete data (and complete data as usual). (c) A logical integration and simultaneous representation of dimensional data and metadata, the latter by means of semantic dimensional constrains and dimensional rules. (d) Dimensional navigation and data generation, both upwards and downwards (the examples above show only the upward case).
(B) We establish that, under natural assumptions that MD ontologies belong to the class of weakly-sticky (WS) Datalog programs (Calì et al., 2012c), for which conjunctive QA is tractable (in data). The class of W S programs is an extension of sticky Datalog(Calì et al., 2012c) and weakly-acyclic programs (Fagin et al., 2005). Actually, W S Datalog is defined through restrictions on join variables occurring in infinite-rank positions, as introduced in (Fagin et al., 2005).
In this work, we do not provide algorithms for (tractable) QA on weakly-sticky Datalog programs. However, in (Milani & Bertossi, 2016b) a practical algorithm was proposed, together with a methodology for magic-set- based query optimization.
(C) We analyze the interaction between dimensional constraints and the dimensional rules, and their effect on QA. Most importantly, the combination of constraints that are equality-generating dependencies (egds) and the rules, which are tuple-generating dependencies (tgds) (Calì et al., 2003), may lead to undecidability of QA. Separability (Calì et al., 2012c) is a semantic condition on egds and tgds that guarantees the interaction between them does not harm the tractability of QA. Separability is an application-dependent issue. However, we show that, under reasonable syntactic conditions on egds in MD ontologies, separability holds.
(D) We propose a general ontology-based approach to contextual quality data specification and extraction. The methodology takes advantage of a MD ontology and a process of dimensional navigation and data generation that is triggered by queries about quality data. We show that under natural conditions the elements of the quality-oriented ontology , in form of additional Datalogrules and constraints, do not affect the good computational properties of the core MD ontology .
The closest work related to our OMD model can be found in the dimensional relational algebra proposed in (Martinenghi & Torlone, 2014), which is subsumed by the OMD model (Milani, 2017, chap. 4). The contextual and dimensional data representation framework in (Bolchini et al., 2013) is also close to our OMD model in that it uses dimensions for modeling context. However, in their work dimensions are different from the dimensions in the HM data model. Actually, they use the notion of context dimension trees (CDTs) for modeling multidimensional contexts. Section 6.5 includes more details on related work.
This paper is structured as follows. Section 2 contains a review of databases, Datalog, and the HM data model. Section 3 formalizes the OMD data model. Section 4 analyzes the computational properties of the OMD model. Section 5 extends the OMD model with additional contextual elements for specifying and extracting quality data, and show how to use the extension for this task. Section 6 discusses additional related work, draws some final conclusions, and includes a discussion of possible extensions of the OMD model. This paper considerably extends results previously reported in (Milani & Bertossi, 2015b).
In this section, we briefly review relational databases and the multidimensional data model.
2.1. Relational Databases
We always start with a relational schema with two disjoint domains: , with possibly infinitely many constants, and , of infinitely many labeled nulls. also contains predicates of fixed finite arities. If is an -ary predicate (i.e. with arguments) and , denotes its -th position. gives rise to a language of first-order (FO) predicate logic with equality (). Variables are usually denoted with , and sequences thereof by . Constants are usually denoted with ; and nulls are denoted with . An atom is of the form , with an -ary predicate and terms, i.e. constants, nulls, or variables. The atom is ground (aka. a tuple) if it contains no variables. An instance for schema is a possibly infinite set of ground atoms; this set is also called an extension for the schema. In particular, the extension of a predicate in an instance , denoted by , is the set of atoms in whose predicate is . A database instance is a finite instance that contains no nulls. The active domain of an instance , denoted , is the set of constants or nulls that appear in atoms of . Instances can be used as interpretation structures for language .
An instance may be closed or incomplete (a.k.a. open or partial). In the former case, one makes the meta-level assumption, the so-called closed-world-assumption (CWA) (Reiter, 1984; Abiteboul et al., 1995), that the only positive ground atoms that are true w.r.t. are those explicitly given as members of . In the latter case, those explicit atoms may form only a proper subset of those positive atoms that could be true w.r.t. .777 In the most common scenario one starts with a (finite) open database instance that is combined with an ontology whose tgds are used to create new tuples. This process may lead to an infinite instance . Hence the distinction between database instances and instances.
A homomorphism is a structure-preserving mapping, , between two instances and for schema such that: (a) implies , and (b) for every ground atom : if , then .
A conjunctive query (CQ) is an FO formula, , of the form:
with , and (distinct) free variables . If has (free) variables, for an instance , is an answer to if , meaning that becomes true in when the variables in are componentwise replaced by the values in . denotes the set of answers to in . is a boolean conjunctive query (BCQ) when is empty, and if it is true in , in which case . Otherwise, , and we say it is false.
A tuple-generating dependency (tgd), also called a rule, is an implicitly universally quantified sentence of of the form:
with , and , and the dots in the antecedent standing for conjunctions. The variables in (that could be empty) are the existential variables. We assume . With and we denote the atom in the consequent and the set of atoms in the antecedent of , respectively.
A constraint is an equality-generating dependency (egd ) or a negative constraint (nc), which are also sentences of , respectively, of the forms:
with , and , and is a symbol that denotes the Boolean constant (propositional variable) that is always false. Satisfaction of constraints by an instance is as in FO logic. In Section 3 we will use ncs with negated body atoms (i.e. negative literals), in a limited manner. Their semantics is also as in FO logic, i.e. the body cannot be made true in a consistent instance, for any data values for the variables in it.
Tgds, egds, and ncs are particular kinds of relational integrity constraints (ICs) (Abiteboul et al., 1995). In particular, egds include key constraints and functional dependencies (FDs). ICs also include inclusion dependencies (IDs): For an -ary predicate and an -ary predicate , the ID , with , means that -in the extensions of and in an instance- the values appearing in the th position (attribute) of must also appear in the th position of .
Relational databases work under the CWA, i.e. ground atoms not belonging to a database instance are assumed to be false. As a consequence, an IC is true or false when checked for satisfaction on a (closed) database instance, never undetermined. However, as we will see below, if instances are allowed to be incomplete, i.e. with undetermined or missing ground atoms, ICs may not be false, but only undetermined in relation to their truth status. Actually, they can be used, by enforcing them, to generate new tuples for the (open) instance.
Datalog is a declarative query language for relational databases that is based on the logic programming paradigm, and allows to define recursive views(Abiteboul et al., 1995; Ceri et al., 1990). A Datalog program for schema is a finite set of non-existential rules, i.e. as in (7) but without -variables. Some of the predicates in are extensional, i.e. they do not appear in rule heads, and their complete extensions are given by a database instance (for a subschema of ), that is called the program’s extensional database. The program’s intentional predicates are those that are defined by the program by appearing in tgds’ heads. The program’s extensional database may give to them only partial extensions (additional tuples for them may be computed by the application of the program’s tgds). However, without loss of generality, it is common with Datalog to make the assumption that intensional predicates do not have an explicit extension, i.e. explicit ground atoms in .
The minimal-model semantics of a Datalog program w.r.t. an extensional database instance is given by a fix-point semantics (Abiteboul et al., 1995): the extensions of the intentional predicates are obtained by, starting from , iteratively enforcing the rules and creating tuples for the intentional predicates, i.e. whenever a ground (or instantiated) rule body becomes true in the extension obtained so far, but not the head, the corresponding ground head atom is added to the extension under computation. If the set of initial ground atoms is finite, the process reaches a fix-point after a finite number of steps. The database instance obtained in this way turns out to be the unique minimal model of the Datalog program: it extends the extensional database , makes all the rules true, and no proper subset has the two previous properties. Notice that the constants in a minimal model of a Datalog program are those already appearing in the active domain of or in the program rules; no new data values of any kind are introduced.
One can pose a CQ to a Datalog program by evaluating it on the minimal model of the program, seen as a database instance. However, it is common to add the query to the program, and the minimal model of the combined program gives us the set of answers to the query. In order to do this, a CQ as in (6) is expressed as a Datalog rule of the form:
where is an auxiliary, answer-collecting predicate. The answers to query form the extension of predicate in the minimal model of the original program extended with the query rule. When is a BCQ, is a propositional atom; and is true in the undelying instance exactly when the atom belongs to the minimal model of the program.
Example 2.1 ().
A Datalog program containing the rules , and recursively defines, on top of an extension for predicate , the intentional predicate as the transitive closure of . With as the extensional database, the extension of can be computed by iteratively adding tuples enforcing the program rules, which results in the instance , which is the minimal model of the program.
The CQ can be expressed by the rule . The set of answers is the computed extension for on instance , namely . Equivalently, the query rule can be added to the program, and the minimal model of the resulting program will contain the extension for the auxiliary predicate : .
Datalog is an extension of Datalog. The “” stands for the extension, and the “”, for some syntactic restrictions on the program that guarantee some good computational properties. We will refer to some of those restrictions in Section 4. Accordingly, until then we will consider Datalog programs.
A Datalog program may contain, in addition to (non-existential) Datalog rules, also existential rules rules of the form (7), and constraints of the forms (8) and (9). A Datalog program has an extensional database . In a Datalog program , unlike plain Datalog, predicates are not necessarily partitioned into extensional and intentional ones: any predicate may appear in the head of a rule. As a consequence, some predicates may have partial extensions in the extensional database , and their extensions will be completed via rule enforcements.
The semantics of a Datalog program with an extensional database instance is model-theoretic, and given by the class of all, possibly infinite, instances for the program’s schema (in particular, with domain contained in ) that extend and make true. Notice that, and in contrast to Datalog, the combination of the “open-world assumption” and the use of -variables in rule heads makes us consider possibly infinite models for a Datalogprogram, actually with domains that go beyond the active domain of the extensional database.
If a Datalog program has an extensional database instance , a set of tgds, and a set of constraints of the forms (8) or (9), then is consistent if is non-empty, i.e. the program has at least one model.
Given a Datalogprogram with database instance and an -ary CQ , is an answer w.r.t. iff for every , which is equivalent to . Accordingly, this is certain answer semantics. In particular, a BCQ is true w.r.t. if it is true in every . In the rest of this paper, unless otherwise stated, CQs are BCQs, and CQA is the problem of deciding if a BCQ is true w.r.t. a given program.888 For Datalog programs, CQ answering, i.e. checking if a tuple is an answer to a CQ query, can be reduced to BCQ answering as shown in (Calì et al., 2013), and they have the same data complexity.
Without any syntactic restrictions on the program, and even for programs without constraints, conjunctive query answering (CQA) may be undecidable (Beeri & Vardi, 1981). CQA appeals to all possible models of the program. However, the chase procedure (Maier et al., 1979) can be used to generate a single, possibly infinite, instance that represents the class for this purpose. We show it by means of an example.
Example 2.2 ().
Consider a program with the set of rules , and , and an extensional database instance , providing an incomplete extension for the program’s schema. With the instance , the pair , with (value) assignment (for variables) , is applicable: . The chase enforces by inserting a new tuple into ( is a fresh null, i.e. not in ), resulting in instance .
Now, , with , is applicable, because . The chase adds into , resulting in . The chase continues, without stopping, creating an infinite instance, usually called the chase (instance): .
For some programs an instance obtained through the chase may be finite. Different orders of chase steps may result in different sequences and instances. However, it is possible to define a canonical chase procedure that determines a canonical sequence of chase steps, and consequently, a canonical chase instance (Calì et al., 2013).
Given a program and extensional database , its chase (instance) is a universal model (Fagin et al., 2005): For every , there is a homomorphism from the chase into . For this reason, the (certain) answers to a CQ under and can be computed by evaluating over the chase instance (and discarding the answers containing nulls) (Fagin et al., 2005). Universal models of Datalog programs are finite and coincide with the minimal models. However, the universal model of a Datalog program may be infinite, this is when the chase procedure does not stop, as shown in Example 2.2. This is a consequence of the OWA underlying Datalog programs and the presence of existential variables.
If a program consists of a set of tgds and a set of ncs , then CQA amounts to deciding if . However, this is equivalent to deciding if: (a) , or (b) for some , , where is the BCQ obtained as the existential closure of the body of (Calì et al., 2012c, theo. 6.1). In the latter case, is inconsistent, and becomes trivially true. This shows that CQA evaluation under ncs can be reduced to the same problem without ncs, and the data complexity of CQA does not change. Furthermore, ncs may have an effect on CQA only if they are mutually inconsistent with the rest of the program, in which case every BCQ becomes trivially true.
If has egds, they are expected to be satisfied by a modified (canonical) chase (Calì et al., 2013) that also enforces the egds. This enforcement may become impossible at some point, in which case we say the chase fails (cf. Example 2.3). Notice that consistency of a Datalog program is defined independently from the chase procedure, but can be characterized in terms of the chase. Furthermore, if the canonical chase procedure terminates (finitely or by failure) the result can be used to decide if the program is consistent. The next example shows that egds may have an effect on CQA even with consistent programs.
Example 2.3 ().
Consider a program with with two rules and an egd:
The chase of first applies (11) and results in . There are no more tgd/assignment applicable pairs. But, if we enforce the egd (13), equating and , we obtain . Now, (12) and are applicable, so we add to , generating . The chase terminates (no applicable tgds or egds), obtaining .
Notice that the program consisting only of (11) and (12) produces as the chase, which makes the BCQ evaluate to false. With the program also including the egd (13) the answer is now true, which shows that consistent egds may affect CQ answers. This is in line with the use of a modified chase procedure that applies them along with the tgds.
Now consider program that is with the extra rule , which enforced on results in . Now (13) is applied, which creates a chase failure as it tries to equate constants and . This is case where the set of tgds and the egd are mutually inconsistent.
2.3. The Hurtado-Mendelzon Multidimensional Data Model
According to the Hurtado-Mendelzon multidimensional data model (in short, the HMmodel) (Hurtado & Mendelzon, 2002), a dimension schema, , consists of a finite set of categories, and an irreflexive, binary relation , called the child-parent relation, between categories (the first category is a child and the second category is a parent). The transitive and reflexive closure of is denoted by , and is a partial order (a lattice) with a top category, All, which is reachable from every other category: , for every category . There is a unique base category, , that has no children. There are no “shortcuts”, i.e. if , there is no category , different from and , with , .
A dimension instance for schema is a structure , where is a non-empty, finite set of data values called members, is an irreflexive binary relation between members, also called a child-parent relation (the first member is a child and the second member is a parent),999 There are two child-parent relations in a dimension: , between categories; and , between category members. and is the total membership function. Relation parallels (is consistent with) relation : implies . The statement is also expressed as . is the transitive and reflexive closure of , and is a partial order over the members. There is a unique member all, the only member of All, which is reachable via from any other member: , for every member . A child member in has only one parent member in the same category: for members , , and , if , and are in the same category (i.e. ), then . is used to define the roll-up relations for any pair of distinct categories : .
Example 2.4 ().
The HM model in Figure 6 includes three dimension instances: Temporal and Disorder (at the top) and Hospital (at the bottom). They are not shown in full detail, but only their base categories Day, Disease, and Ward, resp. We will use four different dimensions in our running example, the three just mentioned and also Instrument (cf. Example 3.2). For the Hospital dimension, shown in detail in Figure 3, , with base category Ward and top category . The child-parent relation contains , , and . The category of each member is specified by , e.g. . The child-parent relation between members contains , , , , , , , , and . Finally, is one of the roll-up relations and contains , , , and .
In the rest of this section we show how to represent an HM model in relational terms. This representation will be used in the rest of this paper, in particular to extend the HM model. We introduce a relational dimension schema , where is a set of unary category predicates, and (for “lattice”) is a set of binary child-parent predicates, with the first attribute as the child and the second as the parent. The data domain of the schema is (the set of category members). Accordingly, a dimension instance is a database instance for that gives extensions to predicates in . The extensions of the category predicates form a partition of .
In particular, for each category there is a category predicate , and the extension of the predicate contains the members of the category. Also, for every pair of categories , with , there is a corresponding child-parent predicate, say , in , whose extension contains the child-parent, -relationships between members of and . In other words, each child-parent predicate in stands for a roll-up relation between two categories in child-parent relationship.
Example 2.5 ().
In order to recover the hierarchy of a dimension in its relational representation, we have to impose some ICs. First, inclusion dependencies (IDs) associate the child-parent predicates to the category predicates. For example, the following IDs: , and . We need key constraints for the child-parent predicates, with the first attribute (child) as the key. For example, is the key attribute for , which can be represented as the egd: .
Assume is the relational schema with multiple dimensions. A fact-table schema over is a predicate , where are attributes with domains for subdimensions , and is the measure attribute with a numerical domain. Attribute is associated with base-category predicate through the ID: . Additionally, is a key for , i.e. each point in the base multidimensional space is mapped to at most one measurement. A fact-table provides an extension (or instance) for . For example, in the center of Figure 6, the fact table is linked to the base categories of the three participating dimensions through its attributes Ward, Disease, and Day, upon which its measure attribute Count functionally depends.
This multidimensional representation enables aggregation of numerical data at different levels of granularity, i.e. at different levels of the hierarchies of categories. The roll-up relations can be used for aggregation.
3. The Ontological Multidimensional Data Model
In this section, we present the OMD model as an ontological, Datalog-based extension of the HM model. In this section we will be referring to the working example from Section 1, extending it along the way when necessary to illustrate elements of the OMD model.
An OMD model has a database schema , where is a relational schema with multiple dimensions, with sets of unary category predicates, and sets of binary, child-parent predicates (cf. Section 2.3); and is a set of categorical predicates, whose categorical relations can be seen as extensions of the fact-tables in the HM model.
Attributes of categorical predicates are either categorical, whose values are members of dimension categories, or non-categorical, taking values from arbitrary domains. Categorical predicate are represented in the form , with categorical attributes (the ) all before the semi-colon (“;”), and non-categorical attributes (the ) all after it.
The extensional data, i.e the instance for the schema , is , where is a complete database instance for subschema containing the dimensional predicates (i.e. category and child-parent predicates); and sub-instance contains possibly partial, incomplete extensions for the categorical predicates, i.e. those in .
Every schema for an OMD model comes with some basic, application-independent semantic constraints. We list them next, represented as ICs.
1. Dimensional child-parent predicates must take their values from categories. Accordingly, if child-parent predicate is associated to category predicates , in this order, we introduce IDs and ), as ncs:
We do not represent them as the tgds , etc., because we reserve the use of tgds for predicates (in their right-hand sides) that may be incomplete. This is not the case for or , which have complete extensions in every instance. For this same reason, as mentioned right after introducing ncs in (8), we use here ncs with negative literals: they are harmless in the sense that they are checked against complete extensions for predicates that do not appear in rule heads. Then, this form of negation is the simplest case of stratified negation (Abiteboul et al., 1995).101010 Datalogwith stratified negation, i.e. that is not intertwined with recursion, is considered in (Calì et al., 2013). Checking any of these constraints amounts to posing a non-conjunctive query to the instance at hand (we retake this issue in Section 4.3).
2. Key constraints on dimensional child-parent predicates , as egds:
3. The connections between categorical attributes and the category predicates are specified by means of IDs represented as ncs. More precisely, for the th categorical position of predicate taking values in category , the ID is represented by:
where is the th variable in the list .
Example 3.1 ().
(ex. 1.1 cont.) The categorical attributes Unit and Day of categorical predicate in are connected to the Hospital and Temporal dimensions, resp., which is captured by the IDs , and . The former is written in Datalog as in (16):
For the Hospital dimension, one of the two IDs for the child-parent predicate is , which is expressed by an nc of the form (14):
The key constraint of WardUnit is captured by an egd of the form (15):
The OMD model allows us to build multidimensional ontologies, . Each of them, in addition to an instance for a schema , includes a set of basic constraints as in 1.-3. above, a set of dimensional rules, and a set of dimensional constraints. All these rules and constraints are expressed in the Datalog language associated to schema . Below we introduce the general forms for dimensional rules in (those in 4.) and the dimensional constraints in (in 5.), which are all application-dependent.
4. Dimensional rules as Datalog tgds:
Here, and are categorical predicates, the are child-parent predicates, , , ; repeated variables in bodies (join variables) appear only in categorical positions in the categorical relations and attributes in child-parent predicates.111111 This is a natural restriction to capture dimensional navigation as captured by the joins (cf. Example 3.2).
Notice that existential variables appear only in non-categorical attributes. The main reason for this condition is that in some applications we may have an existing, fixed and closed-world multidimensional database providing the multidimensional structure and data. In particular, we may not want to create new category elements via value invention, but only values for non-categorical attributes, which do not belong to categories. We will discuss this condition in more detail and its possible relaxation in Section 6.4.
5. Dimensional constraints, as egds or ncs, of the forms:
Here, , , and .