The use of preferences in query answering, both in traditional databases and in ontology-based data access, has recently received much attention due to its many real-world applications. In particular, in recent times, there has been a huge change in the way data is created and consumed, and users have largely moved to the Social Web, a system of platforms used to socially interact by sharing data and collaborating on tasks.
In this paper, we tackle the problem of preference-based query answering in Datalog+/– ontologies assuming that the user must rely on subjective reports to get a complete picture and make a decision. This kind of situation arises all the time on the Web; for instance, when searching for a hotel, users provide some basic information and receive a list of answers to choose from, each associated with a set of subjective reports (often called reviews) written by other users to tell everyone about their experience. The main problem with this setup, however, is that users are often overwhelmed and frustrated, because they cannot decide which reviews to focus on and which ones to ignore, since it is likely that, for instance, a very negative (or positive) review may have been produced on the basis of a feature that is completely irrelevant to the querying user.
We study a formalization of this process and its incorporation into preference-based query answering in Datalog+/– ontologies, proposing the use of trust and relevance measures to select the best reports to focus on, given the user’s initial preferences, as well as novel ranking algorithms to obtain a user-tailored answer. The main contributions of this paper can be briefly summarized as follows.
We present an approach to preference-based top- query answering in Datalog+/– ontologies, given a collection of subjective reports. Here, each report contains scores for a list of features, its author’s preferences among the features, as well as additional information. Theses pieces of information of every report are then aggregated, along with the querying user’s trust into each report, to a ranking of the query results relative to the preferences of the querying user.
We present a basic approach to ranking the query results, where each atom is associated with the average of the scores of all reports, and every report is ranked with the average of the scores of each feature, weighted by the report’s trust values and the relevance of the feature and of the report for the querying user.
We then present an alternative approach to ranking the query results, where we first select the most relevant reports for the querying user, adjust the scores by the trust measure, and compute a single score for each atom by combining the scores computed in the previous step, weighted by the relevance of the features.
We present algorithms for preference-based top- (atomic) query answering in Datalog+/– ontologies under both rankings. We also prove that, under suitable assumptions, the two algorithms run in polynomial time in the data complexity.
Finally, we also propose and discuss a more general form of reports, which are associated with sets of atoms rather than single atoms.
The rest of this paper is organized as follows. In Section 2, we provide some preliminaries on Datalog+/– and the used preference models. Section 3 then defines subjective reports, along with their trust measures and their relevance. In Section 4, we introduce the two rankings of query results, along with top- query answering algorithms under these rankings and data tractability results. Section 5 then presents more general subjective reports. In Section 6, we discuss related work. Finally, the concluding Section 7 summarizes the main results of this paper and gives an outlook on future research.
First, we briefly recall some basics on Datalog+/– , namely, on relational databases and (Boolean) conjunctive queries ((B)CQs) (along with tuple- and equality-generating dependencies (TGDs and EGDs, respectively) and negative constraints), the chase procedure, and ontologies in Datalog+/–. We also define the used preference models.
Databases and Queries. We assume (i) an infinite universe of (data) constants (which constitute the “normal” domain of a database), (ii) an infinite set of (labeled) nulls (used as “fresh” Skolem terms, which are placeholders for unknown values, and can thus be seen as variables), and (iii) an infinite set of variables (used in queries, dependencies, and constraints). Different constants represent different values (unique name assumption), while different nulls may represent the same value. We assume a lexicographic order on , with every symbol in following all symbols in . We denote by sequences of variables with . We assume a relational schema , which is a finite set of predicate symbols (or simply predicates). A term is a constant, null, or variable. An atomic formula (or atom) has the form , where is an -ary predicate, and are terms. We say that is ground iff every belongs to .
A database (instance) for a relational schema is a (possibly infinite) set of atoms with predicates from and arguments from . A conjunctive query (CQ) over has the form , where is a conjunction of atoms (possibly equalities, but not inequalities) with the variables and , and possibly constants, but no nulls. A CQ is atomic iff is a single atom and (i.e., there are no existentially quantified variables). A Boolean CQ (BCQ) over is a CQ of the form , i.e., all variables are existentially quantified, often written as the set of all its atoms without quantifiers, when there is no danger of confusion. Answers to CQs and BCQs are defined via homomorphisms, which are mappings such that (i) implies , (ii) implies , and (iii) is naturally extended to atoms, sets of atoms, and conjunctions of atoms. The set of all answers to a CQ over , denoted , is the set of all tuples over for which there exists a homomorphism such that and . The answer to a BCQ over a database is Yes, denoted , iff .
Given a relational schema , a tuple-generating dependency (TGD) is a first-order formula of the form , where and are conjunctions of atoms over (without nulls), called the body and the head of , denoted and , respectively. Such is satisfied in a database for iff, whenever there exists a homomorphism that maps the atoms of to atoms of , there exists an extension of that maps the atoms of to atoms of . All sets of TGDs are finite here. Since TGDs can be reduced to TGDs with only single atoms in their heads, in the sequel, every TGD has w.l.o.g. a single atom in its head. A TGD is guarded iff it contains an atom in its body that contains all universally quantified variables of . The leftmost such atom is the guard atom (or guard) of . A TGD is linear iff it contains only a single atom in its body. As set of TGDs is guarded (resp., linear) iff all its TGDs are guarded (resp., linear).
Query answering under TGDs, i.e., the evaluation of CQs and BCQs on databases under a set of TGDs is defined as follows. For a database for , and a set of TGDs on , the set of models of and , denoted , is the set of all (possibly infinite) databases such that (i) and (ii) every is satisfied in . The set of answers for a CQ to and , denoted (or, for , ), is the set of all tuples such that for all . The answer for a BCQ to and is Yes, denoted , iff . Note that query answering under general TGDs is undecidable , even when the schema and TGDs are fixed . Decidability and tractability in the data complexity of query answering for the guarded case follows from a bounded tree-width property.
A negative constraint (or simply constraint) is a first-order formula of the form , where (called the body of ) is a conjunction of atoms over (without nulls). Under the standard semantics of query answering of BCQs in Datalog+/– with TGDs, adding negative constraints is computationally easy, as for each constraint , we only have to check that the BCQ evaluates to false in under ; if one of these checks fails, then the answer to the original BCQ is true, otherwise the constraints can simply be ignored when answering the BCQ .
An equality-generating dependency (EGD) is a first-order formula of the form , where , called the body of and denoted , is a conjunction of atoms over (without nulls), and and are variables from . Such is satisfied in a database for iff, whenever there is a homomorphism such that , it holds that . Adding EGDs over databases with TGDs along with negative constraints does not increase the complexity of BCQ query answering as long as they are non-conflicting . Intuitively, this ensures that, if the chase (see below) fails (due to strong violations of EGDs), then it already fails on the database, and if it does not fail, then whenever “new” atoms are created in the chase by the application of the EGD chase rule, atoms that are logically equivalent to the new ones are guaranteed to be generated also in the absence of the EGDs, guaranteeing that EGDs do not influence the chase with respect to query answering.
We usually omit the universal quantifiers in TGDs, negative constraints, and EGDs, and we implicitly assume that all sets of dependencies and/or constraints are finite.
The Chase. The chase was first introduced to enable checking implication of dependencies, and later also for checking query containment. By “chase”, we refer both to the chase procedure and to its output. The TGD chase works on a database via so-called TGD chase rules (see  for an extended chase with also EGD chase rules).
TGD Chase Rule. Let be a database, and a TGD of the form . Then, is applicable to iff there exists a homomorphism that maps the atoms of to atoms of . Let be applicable to , and be a homomorphism that extends as follows: for each , ; for each , , where is a “fresh” null, i.e., , does not occur in , and lexicographically follows all other nulls already introduced. The application of on adds to the atom if not already in .
The chase algorithm for a database and a set of TGDs consists of an exhaustive application of the TGD chase rule in a breadth-first (level-saturating) fashion, which outputs a (possibly infinite) chase for and . Formally, the chase of level up to of relative to , denoted , is defined as , assigning to every atom in the (derivation) level . For every , the chase of level up to of relative to , denoted , is constructed as follows: let be all possible images of bodies of TGDs in relative to some homomorphism such that (i) and (ii) the highest level of an atom in every is ; then, perform every corresponding TGD application on , choosing the applied TGDs and homomorphisms in a (fixed) linear and lexicographic order, respectively, and assigning to every new atom the (derivation) level . The chase of relative to , denoted , is defined as the limit of for .
The (possibly infinite) chase relative to TGDs is a universal model, i.e., there exists a homomorphism from onto every . This implies that BCQs over and can be evaluated on the chase for and , i.e., is equivalent to . For guarded TGDs , such BCQs can be evaluated on an initial fragment of of constant depth , which is possible in polynomial time in the data complexity.
Datalog+/– Ontologies. A Datalog+/– ontology , where , consists of a database , a set of TGDs , a set of non-conflicting EGDs , and a set of negative constraints . We say that is guarded (resp., linear) iff is guarded (resp., linear). The following example illustrates a simple Datalog+/– ontology, which is used in the sequel as a running example.
Consider the following simple ontology , where:
This ontology models a very simple accommodation booking domain, which could be used as the underlying model in an online system. Accommodations can be either hotels, bed and breakfasts, hostels, apartments, or aparthotel. The database provides some instances for each kind of accommodation, as well as some location facts.
Preference Models. We now briefly recall some basic concepts regarding the representation of preferences. We assume the following sets, giving rise to the logical language used for this purpose: is a finite set of constants, is finite set of predicates, and is an infinite sets of variables. These sets give rise to a corresponding Herbrand base consisting of all possible ground atoms that can be formed, which we denote with , while is the Herbrand base for the ontology. Clearly, we have , meaning that preference relations are defined over a subset of the possible ground atoms.
A preference relation over set is any binary relation . Here, we are interested in strict partial orders (SPOs), which are irreflexive and transitive relations—we consider these to be the minimal requirements for a useful preference relation. One possible way of specifying such a relation is the preference formula framework of . We use to denote the set of all possible strict partial orders over a set .
Finally, the rank of an element in a preference relation is defined inductively as follows: (i) iff there is no such that ; and (ii) iff after eliminating from all elements of rank at most .
3 Subjective Reports
Let be a Datalog+/– ontology, be a ground atom such that , and be a tuple of features associated with the predicate , each of which has a domain . We sometimes slightly abuse notation and use to also denote the set of features .
A report for a ground atom is a triple , where , is an SPO over the elements of , and is a set of pairs .
Intuitively, reports are evaluations of an entity of interest (atom ) provided by observers. In a report , specifies a “score” for each feature, indicates the relative importance of the features to the report’s observer, and (called information register) contains general information about the report itself and who provided it. Reports will be analyzed by a user, who has his own strict partial order, denoted , over the set of features. The following is a simple example involving hotel ratings.
Consider again the accommodation domain from Example 1, and let the features for predicate hotel be ; in the following, we abbreviate these features as loc, cl, pri, br, and net, respectively.
An example of a report for is , where is given by the graph in Fig. 1 (left side); (the user’s SPO) is shown in the same figure (right side). Finally, let be a register with fields age, nationality, and type of traveler, with data , , and .
The set of all reports available is denoted with Reports. In the following, we use to denote the set of all reports that are associated with a ground atom . Given a tuple of features , we use to denote the set of all SPOs over .
3.1 Trust Measures over Reports
A user analyzing a set of reports may decide that certain opinions within a given report may be more trustworthy than others. For instance, returning to our running example, the score given for the location feature of might be considered more trustworthy than the ones given for price or breakfast, e.g., because the report declared the former to be among the most preferred features, while the latter are among the least preferred ones, cf. Figure 1 (left). Another example could be a user that is generally untrustworthy of reports on feature cleanliness, because he has learned that people are in general much more critical than he is when it comes to evaluating that aspect of a hotel, or of reports on feature price by business travelers because they do not use their own money to pay. Formally, we have the following definition of trust measure.
A trust measure is any function .
Note that trust measures do not depend on the user’s own preferences over (in ); rather, for each report , they give a measure of trust to each of the scores in depending on and . The following shows an example of a trust measure.
3.2 Relevance of Reports
The other aspect of importance that a user must consider when analyzing reports is how relevant they are to his/her own preferences. For instance, a report given by someone who has preferences that are completely opposite to those of the user should be considered less relevant than one given by someone whose preferences only differ in a trivial aspect. This is inherently different from the trust measure described above, since trust is computed without taking into account the preference relation given by the user issuing the query. Formally, we define relevance measures as follows.
A relevance measure is any function .
Thus, a relevance measure takes as input a report and an SPO and gives a measure of how relevant the report is relative to ; this is determined on the basis of and , and can also take into account.
Consider again the running example, and suppose that the user assigns relevance to a report according to the function
From Fig. 1, e.g., we have that .
Alternatively, a relevance measure comparing the SPO of a report with the user’s SPO (thus, in this case, information in is ignored by the relevance measure) might be defined as follows. The relevance measure checks to what extent the two SPOs agree on the relative importance of the features in . Formally, let and be SPOs over . We define a measure of similarity of and as follows:
Here, is used to denote the symmetric difference (i.e., ). In the definition of ,
the first condition refers to the case where and are expressing the same order between and ,
the second condition refers to the case where both and are not expressing any order between and ,
the third condition refers to the case where one of and is expressing an order between and and the other is not expressing any order,
the last condition refers to the case where and are expressing opposite orders between and .
Clearly, is when and agree on everything, and when and agree on nothing. Finally, we define a relevance measure by for every report and SPO .
4 Query Answering based on Subjective Reports
To produce a ranking based on the basic components presented in Section 3, we must first develop a way to combine them in a principled manner. More specifically, the problem that we address is the following. The user is given a Datalog+/– ontology and has an atomic query of interest. The user also supplies an SPO over the set of features . The answers to an atomic query over in atom form are defined as ; we still use to denote the set of answers in atom form. Recall that in our setting, each ground atom such that is associated with a (possibly empty) set of reports. As we consider atomic queries, then each ground atom is an atom entailed by and thus it is associated with a set of reports . Furthermore, each report is associated with a trust score . We want to rank the ground atoms in ; that is, we want to obtain a set where for ground atom takes into account:
the set of reports associated with ;
the trust score associated with each report ; and
the SPO over provided by the user issuing the query.
4.1 A Basic Approach
A first approach to solving this problem is Algorithm RepRank-Basic in Fig. 2. A score for each atom is computed as the average of the scores of the reports associated with the atom, where the score of a report is computed as follows:
we first compute the average of the scores weighted by the trust value for and a value measuring how important feature is for the user issuing the query (this value is given by );
then, we multiply the value computed in the previous step by , which gives a measure of how relevant is w.r.t. .
The following is an example of how Algorithm RepRank-Basic works.
Consider again the setup from the running example, where we have the Datalog+/– ontology from Example 1, the set Reports of the reports depicted in Fig. 3, the SPO from Fig. 1 (right), the trust measure defined in Example 3, and the relevance measure introduced in Example 4. Finally, let .
Algorithm RepRank-Basic iterates through the set of answers (in atom form) to the query, which in this case consists of . For atom , the algorithm iterates through the set of corresponding reports, which is , and maintains the accumulated score after processing each report. For , the score is computed as (cf. line 6):
The score for after processing the three reports is approximately . Analogously, assuming , the score for is approximately . Therefore, the top-2 answer to is .
The following result states the time complexity of Algorithm RepRank-Basic. As long as both query answering and the computation of the trust and relevance measures can be done in polynomial time, RepRank-Basic can also be done in polynomial time.
The worst-case time complexity of Algorithm RepRank-Basic is , where , , (resp. ) is the worst-case time complexity of (resp. ), and is the data complexity of computing .
In the next section, we explore an alternative approach to applying the trust and relevance measures to top-k query answering.
|cl||0||1||0.5||Age = 34|
|pri||0.4||1||0.25||Nationality = Italian|
|br||0.1||1||0.25||Type = Business|
|cl||0.3||0.1||1||Age = 45|
|pri||0.2||0.1||0.5||Nationality = Italian|
|br||0.5||0.4||0.5||Type = Leisure|
|cl||0.9||0.5||0.0313||Age = 29|
|pri||0.8||0.9||0.25||Nationality = Spanish|
|br||0.8||0.9||0.25||Type = Leisure|
4.2 A Different Approach to using Trust and Relevance
A more complex approach consists of using the trust and relevance scores provided by the respective measures in a more fine-grained manner. One way of doing this is via the following steps (more details on each of them are given shortly):
Keep only those reports that are most relevant to the user issuing the query, that is, those reports that are relevant enough to according to a relevance measure ;
consider the most relevant reports obtained in the previous step and use the trust measure given by the user to produce scores adjusted by the trust measure; and
for each atom, compute a single score by combining the scores computed in the previous step with .
The first step can simply be carried out by checking, for each report , if is above a certain given threshold. One way of doing the second step is described in Algorithm SummarizeReports (Fig. 4), which takes a trust measure , a set of reports Reports (for a certain atom), and a function collFunc. The algorithm processes each report in the input sets by building a histogram of average (trust-adjusted) reported values for each of the features with ten possible “buckets” (of course, this can be easily generalized to any number of buckets); for each report, the algorithm applies the trust measure to update each feature’s histogram. Once all of the reports are processed, the last step is to collapse the histograms into a single value—this is done by applying the collFunc function, which could simply be defined as the computation of a weighted average for each feature. This single value is finally used to produce the output, which is a tuple of scores. The following example illustrates how SummarizeReports works.
Let us adopt again the setup from Example 5. Suppose we want to keep only those reports for which the relevance score is above (as per the first step of our more complex approach). Recall that the set of answers to is and there are six associated reports. Among them, we keep only reports , , , and . Algorithm SummarizeReports will have when called for . The histograms built during this call are as follows:
loc: value 0.95 in bucket ;
cl: value 1 in bucket and value 0.3 in bucket ;
pri: value 0.8 in bucket and value 0.2 in bucket ;
br: value 0.1 in bucket and value 0.5 in bucket ; and
net: value 0.6 in bucket and value 1 in bucket .
Assuming that function collFunc disregards the values in the bucket corresponding to the lowest trust value (if more than one bucket is non-empty), and takes the average of the rest, we have the following result tuple as the output of SummarizeReports: . Analogously, we have tuple for tuple after calling SummarizeReports with .
The following proposition states the time complexity of Algorithm SummarizeReports. As long as the trust measure and the collFunc function can be computed in polynomial time, Algorithm SummarizeReports is polynomial time too.
The worst-case time complexity of Algorithm SummarizeReports is , where (resp. ) is the worst-case time complexity of (resp. collFunc).
The following example explores a few different ways in which function collFunc used in Algorithm SummarizeReports might be defined.
One way of computing collFunc is shown in Example 6. There can be other reasonable ways of collapsing the histogram for a feature into a single value. E.g., collFunc might compute the average across all buckets ignoring the trust measure so that no distinction is made among buckets, i.e., . Alternatively, the trust measure might be taken into account by giving a weight to each bucket (e.g., the weights might be set in such a way that buckets corresponding to higher trust scores have a higher weight, that is, for ). In this case, the histogram might be collapsed as follows . We may also want to apply the above strategies but ignoring the first buckets (for which the trust score is lower). Function collFunc can also be extended so that the number of elements associated with a bucket is taken into account.
Thus, the second step discussed above gives scores (adjusted by the trust measure) for each ground atom. Recall that the third (and last) step of the approach adopted in this section is to compute a score for each atom by combining the scores computed in the previous step with . One simple way of doing this is to compute the weighted average of such scores where the weight of the -th score is the inverse of the rank of feature in .
Algorithm RepRank-Hist (Figure 5) is the complete algorithm that combines the three steps discussed thus far. The following continues the running example to show the result of applying this algorithm.
Let us adopt once again the setup from Example 5, but this time applying Algorithm RepRank-Hist. Suppose collFunc is the one discussed in Example 6 and thus Algorithm SummarizeReports returns the scores for and the scores for . Algorithm RepRank-Hist computes a score for each atom by performing a weighted average of the scores in these tuples, which results in:
Therefore, the top-2 answer to query is .
Note that the results from Examples 5 and 8 differ in the way they order the two tuples; this is due to the way in which relevance and trust scores are used in each algorithm—the more fine-grained approach adopted by Algorithm RepRank-Hist allows it to selectively use both kinds of values to generate a more informed result.
The worst-case time complexity of Algorithm RepRank-Hist is: , where , , is the worst-case time complexity of , is the worst-case time complexity of Algorithm SummarizeReports as per Proposition 2, and is the data complexity of computing .
If the input ontology belongs to the guarded fragment of Datalog+/–, then Algorithms RepRank-Basic and RepRank-Hist run in polynomial time in the data complexity.
Thus far, we have considered atomic queries. As each ground atom such that is associated with a set of reports and every ground atom in is such that , then reports can be associated with query answers in a natural way. We now introduce a class of queries more general than the class of atomic queries for which the same property holds. A simple query is a conjunctive query where contains exactly one atom of the form , called distinguished atom (i.e., an atom whose variables are the query’s free variables). For instance, is a simple query where is the distinguished atom. The answers to a simple query over in atom form are defined as where the distinguished atom is of the form ; we still use to denote the set of answers in atom form. Clearly, for each atom in , it is the case that .
5 Towards more General Reports
In the previous section we considered the setting where reports are associated with ground atoms such that . This setup is limited, since it does not allow to express the fact that certain reports may apply to whole sets of atoms—this is necessary to model certain kinds of opinions often found in reviews, such as “accommodations in Oxford are expensive”. We now generalize the framework presented in Sections 3 and 4 to contemplate this kind of reports.
A generalized report (g-report, for short) is a pair , where is a report and is a simple query, called the descriptor of .
We denote with g-Reports the universe of g-reports. Intuitively, given an ontology , a g-report is used to associate report with every atom in —recall that and thus general reports allow us to assign a report to a set of atoms entailed by .
Clearly, a report for a ground atom as defined in Definition 1 is a special case of a g-report in which the only answer to the descriptor is .
Consider our running example from the accommodations domain and suppose we want to associate a certain report with all accommodations in the city of Oxford. This can be expressed with a g-report where with descriptor .
Intuitively, a g-report is a report associated with a set of atoms, i.e., the set of atoms in . A simple way of handling this generalization would be to associate report with every atom in this set. Note that, as in the non-generalized case, it might be the case that two or more g-reports assign two distinct reports to the same ground atom. E.g., we may have a g-report , where , expressing that applies to all accommodations in Oxford, and another g-report , where , expressing that applies to all accommodations that are hotels. In our running example, we would simply associate both and to , , and .
In the approach just described, the reports coming from different g-reports are treated in the same way—they all have the same impact on the common atoms. Another possibility is to determine when a g-report is in some sense more specific than another and take such a relationship into account (e.g., more specific g-reports should a greater impact when computing the ranking over atoms). We consider this kind of scenario in the following section.
Leveraging the Structural Properties of Ontologies
We now study two kinds of structure that can be leveraged from knowledge contained in the ontology. The first is based on the notion of hierarchies, which are useful in capturing the influence of reports in “is-a” type relationships. As an example, given a query requesting a ranking over hotels in Oxfordshire, a report for all hotels in Oxford should have a higher impact on the calculation of the ranking than a report for all accommodations in the UK—in particular, the latter might be ignored altogether since it is too general. The second kind of structure is based on identifying subset relationships between the atoms associated with the descriptors in g-reports. For instance, a report for all hotels in Oxford is more general than a report for all hotels in Oxford city center, since the former is a superset of the latter.
In the following, we define a partial order among reports based on these notions. We begin by defining hierarchical TGDs.
A set of linear TGDs is said to be hierarchical iff for every we have that and there does not exist database over and TGD in of the form such that and share ground instances relative to .
In the rest of this section, we assume that all ontologies contain a (possibly empty) subset of hierarchical TGDs. Furthermore, given ontology where is a set of hierarchical TGDs, and two ground atoms , we say that is-a iff . For instance, in Example 1, set is a hierarchical set of TGDs (assuming that the conditions over the features hold).
Given tuples of features and such that
and vectorsand over the domains of and , respectively, we say that is a particularization of , denoted iff if and otherwise.
Let be a Datalog+/– ontology, be a ground atom such that , and be a g-report with