Constraints are extremely useful in Wikidata, as they can be in any knowledge base. In Wikidata, property constraints express regularities (patterns of data) which should hold in general, but may have exceptions [PropertyConstraintsPortal]. In practice, they are used to identify potential problems (constraint violations) to interested contributors who can then either fix the problem or determine that the particular anomaly is acceptable.
One simple example is the symmetric constraint111For readability we use the English label to identify a Wikidata item, here https://www.wikidata.org/wiki/Q21510862. In formulas, we replace spaces with underscores. which is understood to indicate that whenever a fact 222We use predicate(subject, object) notation rather than (subject, predicate, object). exists for a symmetric property (such as spouse), the fact should normally also be present. As of mid-June 2020 there were over thirty-eight hundred non-symmetric spousal relationships in Wikidata. We know this because of a report generated by a constraint-checking tool. Greater contributor effort, or perhaps additional tools, are needed to determine how many of these non-symmetries are due to missing spouse statements (as opposed to legitimate exceptions), and then create them, but that is a separate challenge. The point here is simply that this constraint-checking tool has produced a valuable report.
Wikidata constraints, however, are represented and processed in an incomplete, ad hoc fashion. Although in most cases they are declared and documented reasonably clearly, the declarations do not fully express their meaning. For example, it is possible to declare that spouse is subject to the symmetric constraint. However, crucially, there is no formal characterization of what it means for a property to be symmetric. That is only stated in natural language documentation.
Stepping outside of Wikidata, it is straightforward to formally express this meaning in first-order logic (FOL) (with and as free variables, as explained in Section 3):
The value of formal characterizations is foundational in Computer Science. We rely on them for clarity in specification in most of our activities. And yet Wikidata lacks the logical framework to take advantage of characterizations like Formula (1). Such a framework, if available, would provide a precise basis for constraint specification, and a logical foundation for constraint-checking implementations.
Further, in current practice specifying a new constraint, and building a constraint checker for it, may be unnecessarily laborious, idiosyncratic, and error-prone. A logical formulation and implementation of constraints would permit constraints to be quickly specified and reduce the implementation burden for each new type of constraint.
In prior work [PatelSchneiderWikidataOnMars], building on the work of Marx et al. [marx2017logic], we have proposed a logical framework for Wikidata, which supports the specification of rules that can be used to draw inferences to achieve a much more complete collection of facts, which in turn can support a more comprehensive, effective, and easy-to-use query service over Wikidata. This is done in a way that accounts for, leverages, and facilitates the use of the representational conventions in Wikidata.
Our logical framework also encompasses the handling of constraints. In this paper, we describe how this is done, and show that nearly all of Wikidata’s existing property constraints can be given a complete characterization in a natural and economical fashion, using a familiar style of logical expression. These logical formulae, unlike documentation in natural language, provide an unambiguous basis for understanding constraints and for implementing constraint checkers. (Indeed, once an evaluation capability exists for these formulae, checking a new constraint requires no new engineering effort.) We also give characterizations for several proposed property constraints that could usefully be added to Wikidata. In addition, we show that our approach allows for representing and handling a broader range of constraints, going beyond property constraints, in the same formalism.
In the next two sections, we give a general overview of the current handling of constraints in Wikidata, and an overview of our logical framework for Wikidata. In three sections after that, we give examples of our approach’s characterization of existing property constraints, proposed property constraints, and several useful non-property constraints. We follow that with discussion, related work, and conclusion sections.
2 Property Constraints in Wikidata
In current Wikidata practice, “constraint” is used for both “property constraints” and “complex constraints”. We give here an overview of Wikidata property constraints. Complex constraints333https://www.wikidata.org/wiki/Template:Complex_constraint (also known as “custom constraints”) are not considered in this paper, although our approach can handle constraints beyond property constraints.
At present there are 30 property constraint types used in Wikidata, as revealed by the “up-to-date list” SPARQL query link included on [PropertyConstraintsPortal]. As explained on that page, “constraints for a property are specified as statements on the property, using property constraint (P2302) and the constraint type item”. For example, in the notation we’ve adopted for this paper the following statement says that spouse (P26) is constrained by the symmetric constraint (Q21510862) constraint type.
Many constraints are configurable by specifying values for parameters, which are stated as qualifiers on the constraint statement. (Statement and qualifier in Wikidata are defined in the Wikibase Data Model [WikibaseDataModel]). There are several general parameters that can be added to any constraint statement, such as constraint status (which can have values mandatory constraint or suggestion constraint) and exception to constraint (which is used to list known exceptions). There are other parameters that are specific to a particular constraint type, or a small group of constraint types. We shall see examples of some of these in subsequent sections.
3 Logical Framework
Our logical framework for Wikidata [PatelSchneiderWikidataOnMars] supports the use of both rules and constraints. Rules are used to draw inferences; constraints are used to detect the presence of questionable data patterns. After briefly reviewing the prior work of Marx et al. [marx2017logic] – which produced MARS, MAPL, and MARPL – we then introduce our extensions to these – eMARS, eMAPL, and eMARPL – which are the logical foundations of our approach. In our approach, rules are expressed in eMARPL, and constraints in eMAPL.
3.1 MARS, MAPL, and MARPL
As noted in [marx2017logic], Wikidata’s custom data model goes beyond the Property Graph data model, which associates sets of attribute-value pairs with the nodes and edges of a directed graph, by allowing for attributes with multiple values. Marx et al. refer to such generalized Property Graphs as multi-attributed graphs, and observes that “In spite of the huge practical significance of these data models …, there is practically no support for using such data in knowledge representation”. Given that motivation, Marx et al. introduce the multi-attributed relational structure (MARS) to provide a formal data model for generalized Property Graphs, and multi-attributed predicate logic (MAPL) for modeling knowledge over such structures. MARS and MAPL may be viewed as extensions of FOL to support the use of attributes (with multiple values). In terms of the underlying logical formalism (which is out of scope here), MARS provides the structures that serve as interpretations for MAPL.
The essential new elements over FOL are these:
a set term is either a set variable or a set of attribute-value pairs , where can be object terms. Object terms are the usual basic terms of FOL, and can be either constants or object variables.
a relational atom is an expression , where is an n-ary predicate, are object terms and is a set term.
a set atom is an expression , where are object terms and a set term.
These elements are best illustrated with a simple example (taken directly from [marx2017logic]):
This MAPL formula states that spouse is a symmetric relation, where the inverse statement has the same start date, end date, and location. Each occurrence of is a relational atom, which includes the set term . If that set term were represented by a set variable , then one could make an assertion about its membership using the set atom .
In Wikidata terms, this particular relational atom (once and have been instantiated to specific Wikidata items) corresponds to a statement, and each attribute-value pair (once the variable has been instantiated to a specific value), corresponds to a qualifier of the statement. (, of course, is called the subject of the statement, and the value or object of the statement.) While MAPL allows for predicates of arbitrary arity, in Wikidata all statements are triples; i.e. Wikidata properties have arity 2.
Marx et al. go on to introduce multi-attributed rule-based predicate logic (MARPL), a MAPL fragment which is decidable for fact entailment, but still provides a high level of expressivity. In addition, they define MARPL, and show that deciding fact entailment is in polynomial time with respect to data complexity (i.e., when considering rules, but not data, to be fixed). Due to these characteristics of MARPL and MARPL, along with its logically well-founded handling of attributes, we believe it to be the best logical foundation for expressing inference rules in Wikidata, and supporting reasoning using such rules. Note that Formula 2 falls within the MARPL fragment. MARPL also allows for a special type of function that can be used to construct an attribute set in the head of a rule. A MARPL ontology, then, includes a set of rules and a set of these function definitions. Because the representation and checking of constraints in our framework builds on MAPL rather than MARPL, we omit any further details about MARPL.
3.2 eMARS, eMAPL, and eMARPL
MARS / MAPL / MARPL are close to providing a logical basis for Wikidata but are still missing 2 essential elements:
Wikidata-specific datatypes. Datatypes play a large role in Wikidata, and it has its own set of datatypes with certain idiosyncrasies, as documented in [WikibaseDataModel]. In order to specify the manipulation of data elements in rules, functions and relations are needed for constructing, accessing, and combining the data elements of each of Wikidata’s datatypes.
A feasible means of specifying the uses of attributes in rules. Handling Wikidata qualifiers (which are represented as attributes in MAPL and MARPL) correctly requires accounting for potentially many attributes in each of many rules, which is infeasible, from a practical perspective, with MARPL.
In [PatelSchneiderWikidataOnMars], we provide a semi-detailed sketch for addressing each of these needs. (A more formal specification will be provided in a future publication.) Specifically, we define an extended MARS (eMARS) as a MARS extended with a specification of datatypes, with their associated relations and functions, and we discuss the functions and relations that are needed for each of Wikidata’s datatypes. We define extended MAPL (eMAPL) to include eMAPL terms, which are MAPL terms augmented with datatype function applications, and eMAPL formulae, which allow for the use of eMAPL terms and datatype relations as predicates. To further support the representation of constraints, we also add equality and, as syntactic sugar, counting quantifiers.
To address the second need mentioned above, we introduce attribute characterizations, which provide a means to describe the desired behavior of attributes when rules fire, separately from the rules themselves, and we define an extended MARPL (eMARPL) ontology to include, in addition to rules and function definitions, a set of attribute characterizations. We also describe how these characterizations can be used as macros, modifying the functions and rules of an eMARPL ontology.
Given these logical constructs, we show in [PatelSchneiderWikidataOnMars] how Wikidata itself can be represented as an eMARS, and discuss some of the essential rules that are needed for inferencing in Wikidata (including, but not limited to, ontological rules that axiomatize foundational Wikidata concepts such as instance of, subclass of, subproperty of, reflexive property, and transitive property). Other types of rules are possible and important, such as the rules instantiated in the SQID tool [marx2017sqid]. The “meaning of Wikidata” is then the inferential closure of the eMARS under an eMARPL ontology composed of rules, function definitions, and attribute characterizations. It is this eMARS that is used when querying or otherwise requesting what is true in Wikidata, or checking constraints.
3.3 Representing Constraints in eMAPL
We model Wikidata constraints as eMAPL formulae that are evaluated over the eMARS that is the “meaning of Wikidata”. Because constraint formulae are used as queries, and not for inferencing, we can take advantage of the greater expressiveness of eMAPL. It is known that the data complexity of evaluating FOL formulae is polynomial, and that remains true for eMAPL formulae.
Constraints can either be given a positive formulation, which expresses a pattern of data elements that conform to the constraint, or a negative formulation, which expresses a pattern of data elements that violate the constraint. In our view, it is most natural to first write the positive formulation, and from that derive the negative formulation, which can then be used as a query. (The derivation of the negative formulation starts with applying the negation operator to the positive formulation, and then applies transformations, if desired, based upon well-known laws of logic.)
For example, the distinct_values_constraint444https://www.wikidata.org/wiki/Help:Property_constraints_portal/Unique_value in Wikidata indicates that a given property should have different values for different items (across all of Wikidata). The following eMAPL formula embodies this constraint. Here, because we are treating these formulae as queries, the variables are considered to be free variables. We omit attribute sets wherever they are irrelevant to the meaning of the constraint. In other words, for each atom missing an attribute set there is an implicit variable, which can be ignored by a constraint-checker (formula evaluator), or treated as an additional free variable.
Formula 3 (the positive formulation) directly expresses the meaning of the constraint in the usual fashion of first-order logic. If satisfied (for all possible bindings of the free variables), the constraint has no violations.
In all of the formulae for existing property constraints, we employ Wikidata’s property constraint declarations, which works nicely. For example, in Formula 3, the first conjunct will match against one of Wikidata’s existing property constraint declarations, thereby binding to one of the properties having the distinct values constraint (e.g., the ISBN-13 property, P212).
Formula 4 below (the negative formulation), where satisfied, identifies items that violate the constraint.
Because, in our framework, constraints are checked after the KB has been augmented by running the rules (i.e., the constraints are checked over the “meaning of Wikidata” KB), a far more useful set of results will be obtained. Inferences from rules will instantiate facts that were missing from the original KB, thus providing a complete (with respect to the rules) set of facts to be checked. Consequently, a complete and accurate set of constraint violations will be found, and false positives and negatives (which would have resulted from missing facts) will be avoided.
In our framework, as illustrated above, the specification of a new property constraint type involves, in addition to the creation of property constraint type declarations of the sort used in current practice, an eMAPL formula for the new type (or several formulae, if preferred, in some cases). These formulae, unlike documentation in natural language, provide an unambiguous basis for understanding and implementing constraint checkers. Once an evaluation capability exists for eMAPL formulae, checking a new constraint will require no new engineering effort.
We investigated the extent to which Wikidata’s existing property constraints can be expressed in eMAPL. Out of 26 property constraints examined, only one could not readily be expressed in eMAPL. We also became aware of one proposed property constraint that cannot readily be expressed. In both cases, the problem can be addressed in a straightforward manner.
In the next 2 sections, we show examples of existing and proposed property constraints, expressed in eMAPL, which illustrate more of its features. eMAPL allows for representing and handling a broad range of constraints, going beyond property constraints, in the same formalism. In Section 6, we illustrate this with several examples of non-property constraints. In Section 7.2, we discuss the 2 property constraints that could not readily be expressed. In Section 8, we mention some advantages that eMAPL offers over the use of SPARQL.
4 Existing Property Constraints
In the appendix, we give complete characterizations for 26 of the 30 property constraint types in current use. As explained there, we omitted 4 constraint types – the same 4 omitted in [AbianConstraintsReport] – due to insufficient documentation being available for them. Here, we present two of the 26 characterizations, to illustrate other features of eMAPL.
The mandatory qualifier constraint (Q21510856)555https://www.wikidata.org/wiki/Help:Property_constraints_portal/Mandatory_qualifiers provides a nice illustration of attribute set variables and set atoms (from Section 3.1) in the characterization of a constraint type. Here, we see the set atom used to obtain the value of the property qualifier. identifies another qualifier whose use is mandatory with the given property. For example, this constraint type is used with the property population (P1082). If this formula were to be evaluated, when binds with that property, will bind with the qualifier point in time (P585), which is the “mandatory” qualifier. will bind with a fact with property population, and with statement qualifiers . The right-hand side of the formula, then, checks that contains the mandatory qualifier.
This is the positive formulation for this constraint type. If one wanted to identify all of the (very many) instantiations that conform to this constraint type, one could use this positive formulation. But as noted above, in practice one would derive and use the negative formulation to identify violations:
The value type constraint (Q21510865)666https://www.wikidata.org/wiki/Help:Property_constraints_portal/Value_class, which states that each value of the given property should have a given type (which is also known as the range of the property) is an example where it is convenient to express the constraint type with multiple formulae. In this case, we use 3 formulae – one for each possible value of the relation qualifier (although it could be done with a single formula if desired). The relation qualifier characterizes the allowed relationship between the value and the type (which is given by the class qualifier). Note also that these formulae allow for any number of values for the class qualifier, in keeping with current practice.
5 Proposed Property Constraints
Here, we show three other property constraint types that we believe should be included in Wikidata. There are many other useful property constraint types that could be characterized using eMAPL, including many of the suggested types (determined by survey of active Wikipedia editors) listed in [AbianConstraintsReport].
Asymmetric property constraint. Although there is a class asymmetric Wikidata property777https://www.wikidata.org/wiki/Q18647519, there is no property constraint for asymmetry. (This differs from the case of the class symmetric property888https://www.wikidata.org/wiki/Q18647518, which does have a corresponding property constraint.) In any case, the concept of asymmetric property cannot be expressed in eMARPL (and thus, unlike the case of symmetric property, cannot be expressed as a rule of inference). However, asymmetry can easily be expressed as a constraint in eMAPL, as follows.
Local value type constraint. The concept of a “local” value type constraint has proven to be valuable in ontology engineering (where it is sometimes called a “local range restriction”) , and can easily be expressed by extending the characterization of value type constraint (see Section 4). “Local” in this context indicates that the constraint holds when the subject of a statement has a particular type, such as the children of humans being humans. This constraint can be characterized as follows: If the subject item of a statement has the given type (indicated using qualifier local_class), the referenced (object) item should be a subclass or instance of the given type (indicated using qualifier class). This constraint calls for a distinct property constraint statement for each local class that one desires to distinguish for a given property (but it’s already accepted practice to have multiple property constraint statements for a given property and constraint type). Because we are modeling the declaration of this constraint as an extension of the value type constraint, we retain the 3 possible values for the relation qualifier. (We actually have reservations about the usefulness of the subclass_of and instance_or_subclass_of values, not only here but also for type constraint and value type constraint. However, a discussion of their usefulness is out of scope for this paper.)
Essential property constraint. The importance of a particular property for items of a particular type could be indicated in a similar fashion to local value type constraint. For example, it would be useful to indicate that a person should normally have a parent property statement. Because there are persons whose parents are unknown, a constraint would be more appropriate for this sort of example than a rule, in our framework. This constraint would provide stronger guidance regarding the importance of a particular property than the existing meta-property properties for this type999https://www.wikidata.org/wiki/Property:P1963, which merely indicates the properties that are normally used with items of a particular type. Note that the meaning of this constraint is different than that of allowed entity type constraint, and item requires statement constraint.
This property is also “local” in the sense that it is conditioned on the subject of a statement being of a particular type. In the world of ontology engineering, this constraint is sometimes called a “local existential restriction”.
6 Non-Property Constraints
It is natural to consider a broader range of constraints, and desirable to express them all in the same logical framework. Here, we show eMAPL formulae for several useful constraints that fall outside the definition of “property constraint”. As noted below, some of these are already present in Wikidata (in some other form besides a constraint). For those that are already present, we leverage the existing Wikidata declarations (as we have done for property constraints). To the best of our knowledge, in current Wikidata practice these examples would normally be checked by creating a bot, which would require a greater effort than simply evaluating one of these formulae (as could be done in our proposed framework), and the effort would likely be relatively ad hoc, cumbersome, and error-prone.
6.1 Union of Classes and Disjoint Classes
The existing union of101010https://www.wikidata.org/wiki/Property:P2737 and disjoint union of111111https://www.wikidata.org/wiki/Property:P2738 (meta-)properties can each be expressed with a pair of formulae. Here, we use the “dummy value” list_values_as_qualifiers121212https://www.wikidata.org/wiki/Q23766486 with of131313https://www.wikidata.org/wiki/Property:P642, in accord with existing practice for these properties.
disjoint with141414https://www.wikidata.org/wiki/Wikidata:Property_proposal/disjoint_with, a proposed property, was discussed in 2016 but not adopted. In our opinion, it would be a valuable addition to Wikidata.
6.2 No-value Constraint
We think the best treatment of a no-value snak151515https://www.mediawiki.org/wiki/Wikibase/DataModel#PropertyNoValueSnak is as a constraint but it is unclear whether a no-value snak means no value at all, no value with the same qualifiers (as the no-value snak), or something in between. These options can be modelled as eMAPL constraint formulae. Note that the some-value snak doesn’t call for a constraint, but is addressed by other means in [PatelSchneiderWikidataOnMars].
Formula 9 captures the “no value at all” interpretation. Note that no_value statements do not exist per se in Wikidata, but could be generated from Wikidata’s internal representation of PropertyNoValueSnak.
Formula 10 captures the “no value with same qualifiers” interpretation.
6.3 Other Examples
Formula 11 expresses the existing metasubclass of relation161616https://www.wikidata.org/wiki/Property:P2445 between two metaclasses: instances of the metaclass are likely to be subclasses of classes that are instances of the metaclass .
7.1 Rules Versus Constraints
In a setting such as our proposed framework, there are some logical characterizations that can be sensibly used as either rules or constraints. For example, the concept of symmetric property, treated as a property constraint in Wikidata and thus included as a constraint in this paper, could be used as a rule in our framework, if one considers that it has no exceptions. We tend towards this view ourselves, and in fact, offer a rule for symmetric property171717Wikidata includes a class symmetric property, but it is deprecated. in [PatelSchneiderWikidataOnMars], as well as rules that characterize the meaning of reflexive property, transitive property, instance of, subclass of, and subproperty of. In our framework, if a logical characterization is considered to be without exception, and can be expressed in eMARPL, there is no need to express it as a constraint. This is because the reasoning provided by firing the rule will ensure that there are no exceptions to be found by a constraint formula.
Some constraints (any whose eMAPL formula is also an eMARPL rule) could be used as rules, as-is. Given a framework that allows for both rules and constraints, such as our proposed framework, it isn’t necessarily obvious in every case whether a logical characterization should be treated as a rule or a constraint. It can depend not only on logical expressiveness, but also on intuitions and practices that have developed in the community. For example, the authors’ intuition and experience indicate that the concept of symmetry is inherent in symmetric properties by definition (as can easily be seen in the case of spouse or sibling), and thus one needn’t and shouldn’t allow for exceptions. Space constraints preclude a full discussion of this question of whether a rule or constraint usage is more suitable for a given logical characterization.
In current Wikidata practice, there is evidence of considerable ambivalence about the extent to which property constraints should allow for exceptions. The Help page for property constraints [PropertyConstraintsPortal] states that “Constraints are hints, not firm restrictions, and are meant as a help or guidance to the editor. They can have exceptions…”. At the same time, any constraint can be marked with a constraint status of mandatory, and 29.2% of constraints are characterized in this way, whereas only 4.6% of constraints have specified allowed exceptions (using the exception_to_constraint qualifier) [AbianConstraintsReport]. Moreover, the “Wikidata:2020 report on Property constraints” [AbianConstraintsReport] lists as a goal (i.e., an “ideal state”) for 21 existing property constraint types that they should have no exceptions (e.g.,“Goal: No value type constraint on Wikidata has exceptions.”).
We believe this ambivalence exists, in part, because Wikidata doesn’t currently provide an effective representation of rules (or a mechanism for deriving inferences from them), and thus the existing constraints framework has been forced to accommodate some things that ought to be rules (symmetric property, etc.). This provides another strong argument for the adoption of a framework such as ours.
In our framework, because of their use in reasoning, the expressiveness of rules necessarily must be more limited than that of constraint formulae. Thus, there are a few useful logical characterizations (e.g., union of, disjoint union of, disjoint classes) that one might wish to expresses as rules, but would not be able to. In such cases, it would be perfectly reasonable to check them as constraints. If desired, one could arrange by various means to ensure that violations of these constraints are not allowed to occur, thus achieving the effect of a rule, albeit in a somewhat more cumbersome fashion.
We identified one existing constraint (the Commons link constraint) and also became aware of one proposed property constraint (acyclicity) that cannot be readily expressed in eMAPL. The Commons link constraint181818https://www.wikidata.org/wiki/Help:Property_constraints_portal/Commons_link requires knowledge that is not contained in Wikidata. However, by adding Wikimedia Commons metadata to Wikidata (one fact per WC page, giving its name and namespace), this constraint can be easily expressed. The appendix contains additional details. The proposed acyclic property constraint, mentioned in [AbianConstraintsReport], would check whether a property’s usage has caused a cycle (e.g., A is B’s mother, B is C’s mother, C is A’s mother), which is outside the expressiveness of an FOL-based logic. However, because eMAPL is used only as a query language, it could be extended with property path constructs, like those of SPARQL191919https://www.w3.org/TR/sparql11-query/#propertypaths, which would allow for the expression of this proposed constraint.
We have not yet encountered any desirable non-property constraints that could not be expressed; however, we have not yet performed a thorough search for candidate non-property constraints.
8 Related Work
While there isn’t space to survey the large literature of logical frameworks for knowledge bases, we can highlight relevant work from several slices of that literature.
SPARQL. SPARQL is used extensively with Wikidata, via the Wikidata RDF dump, and in some constraint checking is used in somewhat the same way as we envision for eMAPL. Indeed, translation to SPARQL would be one implementation option for handling constraints expressed in eMAPL. SPARQL, of course, supports filters and many other expressiveness features. However, as also noted in Section 7, so far we’ve only identified one proposed constraint (acyclicity) that goes beyond the expressiveness of eMAPL—and eMAPL could be extended in a well-understood manner to allow for this. SPARQL also has the advantage of being supported by many existing products.
However, eMAPL provides an attribute set notation for qualifiers, which is far more natural and readable than using SPARQL over the complex representation of qualifiers in the RDF dump. Similarly, eMAPL provides Wikidata-specific datatype functions and relations, which, again, results in simpler, more natural, more compact expressions in some cases.202020Wikidata-specific datatype functions and relations are needed, for example, in the contemporary, difference within range, format, integer, and range constraints, as shown in the appendix. eMAPL allows for deployment options that are more integral with the native deployment of Wikidata, thus removing dependency on the RDF dump, and potentially allowing for more continuous, up-to-date constraint checking. At the same time, eMAPL provides a logical foundation for a broader array of deployment options that are external to Wikidata’s native deployment.
Constraints in KBs.
Wikidata’s (and our) perspective on constraints is consistent with the view taken by other recent work on constraints for knowledge-graph-like systems. The SHACL Shapes Constraint Language[shacl], a W3C Recommendation since July 2017, and the Shape Expressions Language 2.1 (ShEx) [shex]
are each used to specify valid data patterns in RDF KBs, and provide a framework for identifying violations of those patterns. The primary differences from our approach are that they are RDF-specific, and are grounded in pattern matching techniques rather than in evaluation of logical formulas. In addition, our approach provides support for Wikidata-specific data types and Wikidata’s use of qualifiers, and benefits from its role in a larger logical framework that supports rule-based inference.[DBLP:journals/corr/Patel-Schneider14] shows how Description Logic axioms (when interpreted in a closed-world setting) can be used for constraint checking, discusses their applicability to RDF KBs, and shows the feasibility of translation to SPARQL as an implementation strategy. The approach herein builds on FOL rather than Description Logic, and again, addresses challenges specific to Wikidata.
Logical foundations for Wikidata. SQID [marx2017sqid] is a browser and editor for Wikidata, which draws inferences from a collection of MARPL rules.212121SQID’s rule set may be viewed at https://tools.wmflabs.org/sqid/#/rules/browse. Our work was informed by SQID’s embodiment of MARPL-based reasoning, and motivated in part by the desire to expand the expressiveness of MARPL rules, as illustrated by the SQID rule set to provide a more complete reasoning framework, and to accommodate Wikidata constraints. [hanika2019discovering] also formalizes a model of Wikidata based on MARS, but with a different objective: the application of “Formal Concept Analysis to efficiently identify comprehensible implications that are implicitly present in the data”. [hanika2019discovering] is thus nicely complementary with [marx2017logic] and with our work, in that it provides a basis for discovering, rather than hand-authoring, new (e)MARPL rules.
Logical foundations for annotated KBs. Annotated RDFS [zimmermann2012general] extends RDFS and RDFS semantics to support annotations of triples. A deductive system is provided, and extensions to the SPARQL query language that enable querying of annotated graphs. While this approach could provide a useful target formalism for Wikidata’s RDF dumps, we have chosen instead to represent Wikidata’s data model as directly as possible, and thus we deliberately avoid the use of the RDF dumps, and the complexities that could arise from adopting RDF as the modeling framework.
Adding attributes to logics. Just as MARPL was developed to provide a (rule-based, Datalog-like) decidable fragment of MAPL, Krötzsch, Ozaki, et al. have also explored description logics as a basis for other decidable fragments of MAPL, and have analyzed the resulting family of attributed DLs in [krotzsch2017attributed, krotzsch2017reasoning, krotzsch2018attributed, ozaki2018happy, ozaki2019temporally]. We believe that MARPL provides the best available starting point for modeling Wikidata, but we also agree that this ongoing thread of research will lead to attributed DLs with the right level of expressivity for other sorts of applications.
9 Conclusion and Future Work
After reviewing our prior work that proposes a logical framework for Wikidata, based in part on extended multi-attributed predicate logic (eMAPL), we showed how the framework can be used to give logical characterizations (eMAPL formulas) for constraints in Wikidata, in a manner that makes use of Wikidata’s existing constraint declarations, but goes beyond them to give a complete expression of their meaning. We explained, at a high level, how constraint checking would take place in our framework. We are only aware of two property constraints (one existing, one proposed) which cannot currently be expressed in eMAPL; we explained how these could be addressed with extensions (to Wikidata content in one case, eMAPL in the other). Characterizations are also given for several proposed property constraints, and for several non-property constraints whose use could benefit Wikidata.
In future work, we plan to develop a detailed design for a scalable deployment of our proposed logical framework, in a manner that could integrate well with existing Wikidata infrastructure, workflow, and practices. We also plan to give eMAPL characterizations of the suggested property constraint types (determined by survey of active Wikipedia editors) in [AbianConstraintsReport], and analyze Wikidata’s existing complex constraints and the degree to which they could be accommodated in our framework.
Appendix: Existing Property Constraints
Here we present constraint formulae for 26 of the 30 property constraints in current use (as revealed by the “up-to-date list” SPARQL query link included on [PropertyConstraintsPortal]), in their positive formulations. These are the same 26 property constraints covered by the “Wikidata:2020 report on Property constraints” [AbianConstraintsReport]. We have omitted coverage for 4 property constraints which are inadequately documented (no Help pages that we could find); we expect to investigate them in future work, and do not expect to have any difficulty in characterizing them.
Only one of these 26 constraint characterizations, the Commons link constraint, requires an extension to eMAPL, as explained below. We do not account for uses of the constraint scope222222https://www.wikidata.org/wiki/Property:P4680 parameter; this will be addressed in future work.
The brief constraint descriptions given here, in italics, have been adapted from the Wikidata Help page for property constraints [PropertyConstraintsPortal], or from the individual property constraint Help pages linked from there.
Variable abbreviates Constraint Qualifiers; Statement Qualifiers; subject; predicate; object; item or instance; type class.
Commons link constraint (Q21510852): Values for the property should be valid names of existing pages on Wikimedia Commons within a certain specified namespace.
This property constraint needs access to information from outside of Wikidata. To express the constraint in eMAPL requires converting information in Wikimedia Commons to eMAPL formulae. A simple conversion is to create an eMAPL (atomic) formula for each Wikimedia Commons page that provides the namespace information for the page name, using the predicate Commons_namespace. Then the constraint is simply a check that the appropriate formula is true. There is a constraint formula to check that the page exists and one to check that it is in the correct namespace.
allowed entity types constraint (Q52004125): The property may only be used on certain entity type(s).
allowed qualifiers constraint (Q21510851): Only the given qualifiers may be used with the property. Note: it’s unnecessary to explicitly mention the special case for “no value”, which is present in the documentation for this constraint type.
allowed units constraint (Q21514353): Values for this statement should only use certain units (or none).
Note: it’s unnecessary to explicitly mention the special case for “no value”, which is present in the documentation for this constraint type. In our framework, units are simply datatypes, so the logic allows one to state that is in the datatype .
citation needed constraint (Q54554025): Statements for the property should have at least one reference.
conflicts-with constraint (Q21502838): Items using this property should not have a certain other statement.
contemporary constraint (Q25796498): Two entities linked through a property with this constraint must be contemporary, that is, must coexist at some point in history.
Here, variables abbreviate start time, and end time. date_of_birth, inception, start_time, point_in_time, dissolved,_abolished_or_demolished_date, date_of_death, and end_time are existing Wikidata qualifiers. less_than and overlaps are datatype relations included in eMAPL [PatelSchneiderWikidataOnMars]. Note also that less_than applies to the main value of a time interval.
difference within range constraint (Q21510854): The difference between the values for two properties should be within a certain range or interval. This constraint is available for quantity or date properties.
distinct values constraint (Q21502410): Values for this property should be unique across all of Wikidata, and no other entity should have the same value in a statement for this property.
format constraint (Q21502404): Values for this property should conform to a certain regular expression pattern.
Here we assume that matches_regex is a function associated with the StringValue datatype (a datatype mentioned in [PatelSchneiderWikidataOnMars] in connection with eMARS).
integer constraint (Q52848401): Values of this property should have integer type, i.e. a quantity without decimal places. This constraint type should only be used on properties with quantity datatype.
Here we assume that integer is a function associated with the QuantityValue datatype (a datatype mentioned in [PatelSchneiderWikidataOnMars] in connection with eMARS).
inverse constraint (Q21510855): The property has an inverse property, and values for the property should have a statement with the inverse property pointing back to the original item.
item requires statement constraint (Q21503247): Items using this property should have a certain other statement.
mandatory qualifier constraint (Q21510856): The given qualifier is mandatory for this property.
multi-value constraint (Q21510857): Items should have more than one statement with this property (or none).
One could easily increase the functionality of this constraint by adding a minimum_count parameter, and leveraging eMAPL’s counting quantifier to check for a specified minimum number of statements. Here, checks that there are at least instantiations of from evaluating the following expression.
no bounds constraint (Q51723761) : The value of the property should not be used with upper and lower bounds. This constraint type should only be used on properties with quantity datatype.
Here, precise is a relation associated with the QuantityValue datatype.
none of constraint (Q52558054) : The specified values are not allowed for the property.
one-of constraint (Q21510859): Only the specified values are allowed for the property.
property scope constraint (Q53869507): The property should only be used in one of the specified ways: for the main value of a statement, in a reference or as qualifier.
range constraint (Q21510860): Values for this property should be within a certain range or interval. This constraint is available for quantity or date properties.
The following 2 formulae, using minimum_value and maximum_value qualifiers, are for use with quantity properties. Two additional formulae, instead using minimum_date and maximum_date qualifiers, are needed for date properties. (This conforms with the current declarations and documentation for this constraint.)
single best value constraint (Q52060874) : The property should have a single “best” value for an item. It may have any number of values, but exactly one of them (the “best” one, by whatever criteria) should have preferred rank.
The first formula states that there should be at least one statement with preferred rank; the second formula states that if there are preferred rank statements with 2 different values, they must be distinguished by different “separator values” for the given separator qualifier.
single value constraint (Q19474404): The property generally only has a single value for an item.
symmetric constraint (Q21510862): Statements using this property should exist in both directions.
To also check that the 2 symmetric facts have the same attribute sets, simply add the use of the attribute set variable SQ:
Note, however, that it may not be desirable to check that the 2 statement qualifier attribute sets are identical, as indicated by the example in the Motivation section of WikiProject Reasoning 232323https://www.wikidata.org/wiki/Wikidata:WikiProject_Reasoning. Given that, one could introduce a new constraint parameter to specify exactly which statement qualifiers should be the same (or, conversely, which statement qualifiers are not required to be the same), and craft a constraint expression to check that.
type constraint (Q21503250): Items with the specified property should have the given type.
Here we have chosen to have a separate formula for each possible value of the relation qualifier (although it could be done with a single formula if desired). Note also that these formulas allow for any number of values for the class qualifier, in keeping with current practice.
value requires statement constraint (Q21510864): Values for this property should have a certain other statement.
Here, represents the value (object) of this property statement, the value of the other statement.
value type constraint (Q21510865): The referenced item should be a subclass or instance of the given type.
The notes for type_constraint, above, are applicable here also.