Uncovering Hidden Semantics of Set Information in Knowledge Bases

03/06/2020 ∙ by Shrestha Ghosh, et al. ∙ Max Planck Society 0

Knowledge Bases (KBs) contain a wealth of structured information about entities and predicates. This paper focuses on set-valued predicates, i.e., the relationship between an entity and a set of entities. In KBs, this information is often represented in two formats: (i) via counting predicates such as numberOfChildren and staffSize, that store aggregated integers, and (ii) via enumerating predicates such as parentOf and worksFor, that store individual set memberships. Both formats are typically complementary: unlike enumerating predicates, counting predicates do not give away individuals, but are more likely informative towards the true set size, thus this coexistence could enable interesting applications in question answering and KB curation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Motivation and Problem

Knowledge bases (KBs) like Wikidata Vrandečić (WWW 2012), DBpedia Auer et al. (ISWC 2007), Freebase Bollacker et al. (2008) and YAGO Suchanek et al. (WWW 2007) are important backbones for intelligent applications such as structured search, question answering and dialogue.Properly modelling and understanding the schema of such KBs, and the semantics of their predicates, is a crucial prerequisite for utilizing them. In this paper we focus on set-valued predicates, i.e., predicates which connect entities with sets of entities. Set-valued predicates typically come in two variants: (i) as enumerating predicates, which list individual objects for a given subject, and (ii) as counting predicates, which present total object counts. An example for this is shown in Fig. 1, an excerpt from Wikidata about the former US president Garfield. The predicate child, which lists individual children of Garfield is an enumerating predicate, while numberOfChildren, which gives a count of Garfield’s children, is a counting predicate, and both model the same phenomenon. Set predicates can also model merely related phenomena, for instance, for a given location, the sets described via numberOfInhabitants and birthPlaceOf typically have a considerable overlap, but do not coincide.

Figure 1: Example of enumerating and counting predicates in Wikidata (https://www.wikidata.org/wiki/Q34597).

Identifying set predicates and set alignments would be an important step towards a better understanding of KB semantics. In particular, set alignments would be beneficial for the following use cases:

  • KB curation: Identifying gaps and inconsistencies in the KB and getting directives for acquiring missing pieces of knowledge (e.g., adding the 3 absent children of US president Garfield to the KB) Mirza et al. (ISWC 2018).

  • Query formulation: Aiding users to formulate proper SPARQL queries by showing them related predicates (e.g., finding people with more than 2 children by computing the union of matches for the counting predicate and results from aggregating the instances of the enumerating predicate111Example query for people with children: http://tinyurl.com/y4hemdvcCalvanese et al. (2017).

  • Answer explanation: Exemplifying query results by showing key instances of queries over counting predicates (e.g., showing a few individual Turing Award winners for a query about the number of award winners).

While there is a rich body of research on ontology alignment and schema matching Rahm and Bernstein (2001); Euzenat and Shvaiko (2007); Shvaiko and Euzenat (2013); Jain et al. (ISWC 2010); Suchanek et al. (2011); Wang et al. (2013); Niepert et al. (AAAI 2010); Boldyrev et al. (2018)

, these works typically focus on identifying perfectly matching pairs of predicates with the same or largely overlapping values. This situation differs from our setting where the integer values of counting predicates and the cardinalities of enumerating predicates modelling the same or related phenomenon rarely match perfectly. Properly identifying set predicates and set alignments in knowledge bases is also difficult for other reasons: (i) KBs contain a large number of predicates, often with uninformative names and without coherent type signature, thus making the identification of set predicates and their alignments challenging. (ii) Enumerating predicates are often incomplete (like Garfield’s children) and counting predicates may be approximate estimates only (like number of inhabitants); so cardinalities do not match count values, yet the predicates should be linked in order to couple them for future KB completion, consistency assessment and other use cases listed above.

Approach and Contribution

This paper presentsCounQER (for “Counting Quantifiers and Entity-valued PRedicates”), the first comprehensive methodology towards identifying and linking set predicates in KBs. CounQER is judiciously designed to identify set predicates in noisy and incomplete web-scale KBs such as Wikidata, DBpedia and Freebase. It operates in two stages: In the first stage, supervised classification combining linguistic and statistical features is used to identify enumerating and counting predicates. In the second stage, a set of statistical co-occurrence and correlation measures is used in order to link the set predicates.

Our salient original contributions are:

  1. We introduce the notion of set predicates, its variants, and highlight the benefits that can be derived from identifying their alignments.

  2. We present a two-stage methodology for (i) predicting the counting and enumerating predicates via supervised classification and, (ii) ranking set predicates of one variant aligned to the other variant via statistical and lexical metrics.

  3. We demonstrate the practical viability of our approach by extensive experiments on four KBs: Wikidata, Freebase, and two variants of DBpedia, .

  4. We publish results of our alignment methodology for these KBs at https://tinyurl.com/y2ka4kfu which contains 264 alignments from DBpedia mapping-based KB, 3703 alignments from the DBpedia raw KB, 25 alignments from the Wikidata-truthy KB, 274 alignments from the Freebase KB and, an interactive demo QA system at https://counqer.mpi-inf.mpg.de.

2 Related Work

Schema and ontology alignment

Schema alignment is a classic problem in data integration Rahm and Bernstein (2001). For ontologies and on the semantic web, added complexity comes from taxonomies and ontological constraints Euzenat and Shvaiko (2007); Shvaiko and Euzenat (2013). Approaches to ontology alignment include BLOOMS Jain et al. (ISWC 2010) and PARIS Suchanek et al. (2011), voting-based aggregation Wang et al. (2013), probabilistic frameworks Niepert et al. (AAAI 2010), or methods for the alignment of multicultural data Boldyrev et al. (2018). These methods typically rely on a combination of lexical, structural, constraint and instance based information. Some works have also investigated subset relations Koutraki et al. (2017), yet still focusing only on entity-entity relations. The most important venue in the field of ontology alignment is the long-running Ontology Matching workshop series Shvaiko et al. (2018) along with its attached challenges Algergawy and others (2018). Our setting, where enumerations need to be aligned with counts, is atypical in ontology alignment and has not received prior attention.

Set information in logics and KBs

Modelling count information has a history in qualifying number restrictions Hollunder and Baader (1991) and role restrictions in description logics Calvanese et al. (1998). In the OWL standard McGuinness et al. (2004), count information on relations can be expressed via cardinality assertions. The second statement in Fig. 1, for instance, could be expressed as

ClassAssertion(ObjectExactCardinality(7 :child) :Garfield)

OWL furthermore also supports lower bounds, e.g., that a certain person has at least two children, and upper bounds, e.g., that a certain car has at most five seats. The added expressiveness from counting quantifiers typically comes at a complexity tradeoff, no matter whether they are part of the ontology language Glimm et al. (2008); Calvanese et al. (AAAI 2009), or only of the query language Fan et al. (SIGMOD 2016), especially as they introduce negation (“Companies that are owned by zero other companies”).

Numeric and set information in KBs, QA and IE

Popular knowledge bases contain considerable numeric information. Research has focused on detecting errors and outliers in such information 

Wienand and Paulheim (ESWC 2014), and in organizing and annotating measurement units Neumaier et al. (ISWC 2016); Subercaze (ESWC 2017).

Set information is important for question answering. For instance, in Mirza et al. (ISWC 2018) it is reported that between 5% and 10% of questions in popular TREC QA datasets concern counts. This information need is acknowledged by QA systems. AQQU Bast and Haussmann (CIKM 2015), for instance, includes a special translation for questions starting with “How many?”. The Google search engine similarly answers count keyword queries like “How many children Angelina Jolie” with counts (“6”).

When concerned with numeric information, textual information extraction traditionally focused on temporal information Ling and Weld (AAAI 2010) and measures Saha et al. (ACL 2017). Recently also counting information extraction has received attention, e.g., from sentences like “The LoTR series consists of three books” Mirza et al. (ACL 2017, ISWC 2018). Such information can then be used to assess and improve KB completeness Razniewski et al. (2016).

3 Problem Definition

Let be a set of predicates. A knowledge base (KB) is a set of triples , where , is an entity, and is either an entity or a literal. For the remainder of this paper we assume that each triple with an entity as object also exists in its inverse form in each KB, thus the following elaborations need to consider only one direction.

The foundational concept for this work is that of a set predicate.

Definition 1 (Set Predicate222We emphasize that set predicate refers to the logical structure modelled by the predicate, not to be mixed with the technical implementation in RDF, which does not know a set datatype but can capture them via multiple triples sharing subject and object.).

A predicate which conceptually models the relation between an entity and a set of entities is a set predicate.

Set predicates can be expressed in KBs in two variants: Via binary predicates that enumerate individual set members, and via counting predicates that abstract away from individuals, and store aggregate counts only.

Definition 2 (Enumerating Predicate).

A set predicate that models sets via binary membership statements is called an enumerating predicate.

We denote the set of enumerating predicates as . Assuming a KB was complete, contained no erroneous triples, and no hierarchical redundancy, enumerating predicates could be retrieved by appropriately thresholding predicates by their functionality Suchanek et al. (2011), i.e.,

Definition 3 (Counting Predicate).

A set predicate whose values predominantly represent counts of conceptual entities is called a counting predicate.

Entity counts necessarily are integers, yet KB predicates can contain integers that represent a variety of other concepts, for instance IDs or measures. Assuming a hypothetical entity-count datatype, we could define counting predicates by thresholding on its frequency as follows.

Following these definitions, the predicates child andnumberOfChildren in Fig. 1 are set predicates, the former an enumerating predicate, the latter a counting predicate. Other examples are worksAt and authorOf, which frequently enough take several values for a subject. This is in contrast to predicates of a functional or quasi-functional nature, such as bornIn and mother, which predominantly take a single object, and hence, where counts are uncommon and rarely informative. Yet, there exist also predominantly functional predicates like citizenOf, which nevertheless for some entities (who have multiple citizenships) take multiple values, and hence are enumerating predicates.

Other examples of counting predicates are population, numberOfStudents and airlineDestinations. The distinction between counting predicates and measurement predicates like riverLength and revenue is quite crisp, since measurements usually come with units (km, €, etc.) and can take fractional values ( km) while entity counts cannot. Our definition is phrased to also exclude some predicates taking integer values, like trikotNumber (not a count, because trikot numbers can be assigned arbitrarily) and floorCount (a count but not of something commonly considered as entities). Thus, integer values are a necessary but not a sufficient condition for being a count predicate.

We summarize our first problem as follows.

Problem 1 (Set predicate identification).

Given a KB with a predicate set , identify the set of enumerating predicates, , and the set of counting predicates, .

Note that the above definitions are conceptual only. Functionalities computed over actual KBs are unreliable due to incompleteness, errors, and redundancies, and common KBs do not have a entity-count

datatype. Thus, in later sections we will develop supervised classifiers for identifying both kinds of set predicates.

Let us now turn to the relation between set predicates. Enumerating and counting predicates can be set-related in two ways:

  1. Exact semantic matching: An enumerating predicate and a counting predicate are called exact matching, if for every subject , in reality, the objects are the same as those counted by in .

An example of exact matching predicates are child and numberOfChildren, which conceptually both refer to the same entities, once by listing their names in separate triples, once by storing the aggregate count.

  1. Set-overlapping: An enumerating predicate and a counting predicate are called set-overlapping, if aggregated across subjects , the overlap between the objects and the objects counted by is significantly above a random overlap.

Examples of set-overlapping predicates are population and , whose entity sets typically show a significant overlap, but do not coincide (many people live in the same place they are born in, though neither entails the other). In turn, population and are not set related. Although population sizes and the number of company headquarters in a place are correlated numbers, the described entities do not overlap at all, instead, are even of distinct types (person and company).

Problem 2 (Set predicate alignment).

Given sets of enumerating predicates and counting predicates , for each set predicate , rank the predicates from the other set by their set-relatedness.

Note that the above definitions of set-relatedness are conceptual definitions. In practice, KBs do not give access to the entities counted by counting predicates, instead one only sees aggregate counts. To quantify and qualify set-relatedness, in the following sections, we will thus build a set of unsupervised alignment heuristics.

4 Design Space and Architecture

Design space

Our goal is to develop a robust set predicate identification and linking methodology, that, with limited supervision, can work across different KBs.

If knowledge bases were clean, set predicate identification could solely rely on relation functionality and datatypes. As this is not the case in practice Wu et al. (2014); Wienand and Paulheim (ESWC 2014); Zaveri et al. (2016)

, we instead propose to approach set predicate identification via a supervised classification framework that combines a diverse set of textual, datatype, and statistical features. Schema alignment can also in principle be approached via hand-crafted rules, heuristic alignment metrics, or supervised learning. Due to the particularities of individual predicates and KBs (most set predicates have only very few good alignments), to avoid overfitting, we opt here for a set of heuristic alignment metrics, which we design in order to capture various desiderata of meaningful alignments, and combine them in an ensemble metric.

Architecture

Following the above considerations, we split our CounQER methodology in two steps: (i) supervised predicate classification, and, (ii) heuristic predicate alignment (see Fig. 2).

Figure 2: Architecture of the CounQER approach.

In the first phase, supervised predicate classification

, we use two classifiers to predict the two set variants, namely, enumerating and counting predicates. We rely on a family of features, most importantly, (i) set-related textual features, extracted from a background corpus, (ii) type information about the domain and range of the predicates, and (iii) statistical features about the number of objects per subject at different percentiles.

In the second phase, heuristic predicate alignment, we identify related counting and enumerating predicates using (i) set predicate co-occurrence information, (ii) set predicate value distribution, and (iii) linguistic relatedness. By assigning each pair of enumerating and counting a relatedness score, we can rank related predicates accordingly. While we evaluate the heuristics on labelled data, they are highly complementary, and thus, the choice of the heuristic to be used can be adapted to particular use cases.

KB assumptions

Our approach is designed to work on a variety of knowledge bases, without requiring strong assumptions on their internal structure. Fulfilment of the following features is desirable, though not essential: (i) High-level categories/classes for entities, in particular Person, Place, Organisation, Event and Work. Where these are not available, we utilize links to Wikidata to extract them. (ii) High-level datatypes, in particular float, int and date. Where these are not available, we utilize standard parsers and simple heuristics, such as that numbers between 1900 and 2020 are likely dates. (iii) human-readable labels for properties, with spaces or in camel case notation. Where these are not available, we deactivate corresponding linguistic features.

5 Set Predicate Identification

5.1 Enumerating Predicates

As stated in Section. 3, if KBs were clean, functionality (#triples per subject) would be the criterion for identifying enumerating predicates.

Yet actual KBs contain a considerable amount of noise, are incomplete, and blur functionality by redundancies (e.g., listing both the birth city and country of a person under birthPlace). In CounQER, we thus rely on supervised classification, where functionality is only one of several features towards enumerating predicate identification.

a. Textual features

Where KBs use human-readable predicate names, a basic sanity check for enumerating predicates is to verify whether in human language, the predicate name is used both in singular and plural.

  1. Plural-singular ratio: For each predicate, we apply a heuristic to we generate its plural/singular form. First we identify the last noun in the predicate label using the Python nltk package and then we use the Python inflect library to identify its form (singular/plural) and convert it to the other (plural/singular). We then compute the text frequency ratio based on the Bing API, obtaining, for instance, for a ratio of , while for birthplace, the ratio is , a ratio of 0.08.

b. Type information

Certain types of objects may more naturally be counted than others, and certain subjects may more frequently come with set predicates than others.

  1. Predicate domain: We consider five frequent and general classes, {Person, Place, Organisation, Event and Work}, including their subclasses to capture the domain information of a predicate. We encode the most frequent class via binary features per class, including a 6th class for other.

  2. Predicate range:

    We encode the class of predicate range in binary variables (same as in predicate domain).

c. KB statistics

KB statistics instantiate the observed functionality. As functionality may be blurred by outliers or a long tail of single-valued subjects, we input various datapoints in order to increase resilience of the measure. We also include basic information on datatypes.

  1. Mean, maximum, minimum, 10th and 90th percentile number of objects per subject (functionality): These features describe the number of objects a predicate takes per subject, with mean and percentiles giving resilience against rare outliers. For example, occupation in WIkidata-truthy KB has a mean of , maximum of 30, minimum of 1, 10th percentile of 1 and 90th percentile of 2. The predicate placeOf Birth in Wikidata-truthy KB has a maximum of 6 objects per subjects and 1 object per subject for the other features - minimum, mean, 10th and 90th percentile.

  2. Datatype distribution: The fraction of triples of a predicate whose objects are of datatype entity or a String with comma-separation. For instance, both the predicates occupation and placeOfBirth take entities for of the triples in Wikidata-truthy KB.

These features are then used for a binary classifier.

5.2 Counting Predicates

As per our conceptual definition, counting predicates are distinguished by having entity-count as their datatype. As none of the KBs investigated in this paper records such a datatype, we have to use various heuristic towards identifying counting predicates. An important necessary condition are integer values, yet these alone are not sufficient. We utilize the following classes of features.

a. Textual features

  1. Plural-singular ratio: This feature captures the plural/singular ratio of a predicate obtained exactly as for enumerating predicates.

b. Type information

  1. Predicate domain: We identify the domain of the predicates by tracing the class of the predicates to one of the most general classes in the type heirarchy, {Place, Person, Organization, Event, Work}. Each of the domain class is encoded as a binary variable in the classifier.

c. KB statistics

  1. Datatype distribution: We calculate the fraction of triples of a predicate taking integer values over the total number of triples of that predicate. For instance, the predicate numberOfEpisodes in the DBpedia mapping-based KB takes only integer values, whereas episodeNumber in the DBpedia raw KB takes integer values for of the triples.

  2. Mean, maximum, minimum, 10th and 90th percentile of count value: These features describe the actual integer value of the predicate, e.g., the mean for numberOfEpisodes (DBpedia mapping-based KB) is 106, the maximum is 90015, the minimum is 0, the 10th percentile is 6 and the 90th percentile is 156.

  3. Mean, maximum, minimum, 10th and 90th percentile of the number of objects per subject (functionality): These features describe the number of integer valued triples per subject.

    For example, the mean numberOfEpisodes (DBpedia-mapping-based KB) a subject takes is 1, the maximum is 8, the minimum, the 10th percentile and the 90th percentile all are 1, i.e., most subjects have only one fact containing this predicate. In contrast, an ordinal integer predicate like episodeNumber

    (DBpedia raw KB) has the following statistics - mean 32, maximum 975, minimum 1, 10th percentile 6 and 90th percentile 66. This odd behaviour is exhibited because the article page lists all or a subset of the episode numbers in a series 

    333DBpedia subjects with count of episodenumber facts https://tinyurl.com/dbpedia-raw-episodenumber.

6 Heuristic Predicate Alignment

The output of the previous stage are the enumerating predicates and the counting predicates . The task of this stage is to find for each predicate in the most set-related predicates from the other set. As this task may to some extent be KB-specific, we approach it via a set of unsupervised ranking metrics. We introduce three families of metrics for predicate pairs: (a) set predicate co-occurrence, based on the number of subjects for which and co-occur, (b) set predicate value distribution, based on the relation between the number of objects in and the value of for co-occurring subjects, and (c) set predicate linguistic similarity which measures the relatedness between the labels of the set predicates and .

A. Set Predicate Co-occurrence

Our first family of heuristics ranks predicates by their co-occurrence. Co-occurrence is an indication towards topical relatedness, and we propose various measures that capture absolute and relative co-occurrence frequencies.

  1. : The number of subjects which have triples with both and set predicates. For instance,

  2. : The ratio of the absolute number of subjects for which and co-occur, i.e., with the union of subjects which take either or or both. For instance,

  3. : Co-occurence can also be expressed as a conditional probability, i.e., the ratio of the absolute value,

    , to the number of subjects which take only or . For our given example,

    with respect to subjects only taking the predicate producer and,

    with respect to subjects only taking the predicate numberOfEpisodes. This implies that if a given subject has the predicate numberOfEpisodes, it is more likely that the subject also has the predicate producer than the other way around.

  4. or (P’wiseMI

    ): The log of the ratio of joint distribution of

    and to their individual distributions. We have,

    which implies that the two predicates are less likely to co-occur than their individual occurrences. This metric ranges between . The lower bound is reached when the pair does not co-occur for any subject and the upper bound is reached when either always co-occurs with or vice-versa.

B. Set Predicate Value Distribution

Co-occurrence is important but can nonetheless be spurious, e.g., when many sports teams have both the predicates stadiumSize and coachOf. A possibly even stronger indicator for set relatedness is a match or correlation in values, i.e., if across subjects, the number of values for the enumerating predicate, and the count stored in the counting predicate, coincide, or correlate. We propose three variants: (4) To count the number of exact matches, and in (5) and (6) two relaxed metrics that look for correlation and percentile similarity.

  1. : The ratio of subjects where the number of objects in exactly matches the value in to the number of subjects which takes both and predicates. For example,

  2. : The Pearson correlation between the size of objects of and the value of for all subjects in which they co-occur. So, for the above predicate pair,

  3. : A softer score than a perfect match ratio (B 5), for matching the th percentile value of the number of objects that takes per subject with the th percentile value of the , such that the closer the value is to 1 the better the alignment. Let and denote the distribution of the values and the #objects per subject, respectively.

C. Linguistic Similarity

Besides co-occurence, also correlations can be spurious. For instance, population and headquarterLocation are well correlated (bigger cities host more companies), but nonetheless, they refer to completely different kinds of entities (persons vs. companies). Our third family of heuristics thus looks at topical relatedness.

  1. measures the cosine of the angles between the average of the sets of word vectors of the labels of

    and obtained from pre-trained Glove embeddings Pennington et al. (EMNLP 2014) using the Python Gensim library. Wikidata predicate labels are already individual words, for DBpedia and Freebase we split the predicates at capitalization and punctuation, respectively (e.g., numberOfStudents becomes {number, of, students} and race_count becomes {race, count}). Out of vocabulary words are removed, and empty word lists get assigned similarity zero.

7 Experiments

7.1 KBs used

We use four popular general purpose KBs: (i) DBpedia raw extraction Auer et al. (ISWC 2007), (ii) DBpedia mapping-based extraction Lehmann et al. (2015), (iii) Wikidata truthy Vrandečić (WWW 2012) and (iv) Freebase Bollacker et al. (2008). We analyze each KB in terms of predicate coverage.

  1. DBpedia raw (52.6M triples): All predicate-value pairs present in the infoboxes of English Wikipedia article pages.

  2. DBpedia mapping-based (29M triples): A cleaner infobox dataset where predicates were manuallymapped to a human-generated ontology. Unmapped predicates and type violating triples are discarded.

  3. Wikidata truthy (210.3M triples): Simple triple export of Wikidata that ignores some advanced features such as qualifiers and deprecated ranks.

  4. Freebase (1B triples): The tuple store available as an RDF dump at https://developers.google.com/freebase.

We also analysed YAGO Suchanek et al. (WWW 2007) (1.1B triples), a WordNet-aligned and sanitized harvest of Wikipedia infobox statements, containing only 76 distinct predicates. By manual inspection we found several enumerating predicates, like hasChild and isCitizenOf, but only one counting predicate, numberOfPeople and therefore refrained from further processing of this KB.

KB All Frequent
DBP-raw 73,234 16,635
DBP-map 2,008 1,670
WD-truthy 6,111 4,067
Freebase 799,807 13,872
YAGO (79) (79)
Table 1: Total number of KB predicates (direct + inverse) and most frequent ones.

On adding inverse triples, i.e., adding for every where is an entity, the size of DBpedia-raw increased by 7.6M, DBpedia-map by 18M, Wikidata by 101.1M and Freebase by 442.1M.

To reduce noisy data we use predicates which appear in at least 50 triples. In Table 1 we show the number of predicates that remain after filtering all infrequent predicates. It is evident that the cleaner KBs like Wikidata and DBpedia mapping-based KB have better predicate representation. Freebase and DBpedia raw KBs are noisier with a very long tail of less frequently occurring predicates.

7.2 Preprocessing

Predicate statistics computation

Given a KB of SPO triples, we generate the descriptive statistics of the KB predicates including (i) the datatype distribution - fraction of the triples of a predicate which take

integer, float, date, entity and comm-separated string values, (ii) the mean, maximum, minimum, 10th and 90th percentile of the integer values that a predicate takes, (iii) the mean, maximum, minimum, 10th and 90th percentile of the number of entities per subject of a predicate and, (iv) the mean, maximum, minimum, 10th and 90th percentile of the number of integer values per subject of a predicate.

Type information

We then proceed to find the predicate domain and range. To maintain uniformity across KBs we trace the type to one of the more general classes in the type hierarchy, {Place, Person, Organization, Event, Work}, with the default fallback domain being Thing and range being Literal. We sampled subjects and objects for each predicate and selected the majority class in each set as the domain and range of the predicate.

Textual features

The frequency of occurrrence of a predicate in the web in singular and plural form is determined from the total estimated web search matches returned by the Bing custom search API444https://azure.microsoft.com/en-us/services/cognitive-services/bing-custom-search/.

7.3 Training and evaluation data

We prepared the data for the classification step by employing crowd workers to annotate randomly selected predicate sets for enumerating predicates and for counting predicates from the four KBs - taking 100 from each KB. The annotation task comprised a predicate and five sample subject-object pairs with options to select if the predicate was likely a set predicate (enumerating or counting).

An example question for annotating counting predicates is given below.

Q: Based on the following facts answer whether the relation gives a count of unique entities.

The Herald (Sharon) circulation 15715
H.O.W. Journal circulation 4000
L’Officiel circulation 101719
The Music Scene (magazine) circulation 25000
Pipe Dream (newspaper) circulation 7000

Options: Yes Maybe yes Maybe no Do not know

For counting predicate annotation we used the following question format.

Q: Based on the following facts answer whether the relation enumerates entities.

A Low Down Dirty Shame producer Mike Chapman
Bye Bye Brazil producer Luiz Carlos Barreto
Heaven Knows, Mr. Allison producer Eugene Frenke
Surviving Paradise producer Kamshad Kooshan
I’ll Come Running Back to You producer Bumps Blackwell

Options: Yes Maybe yes Maybe no Do not know

We collected three judgements per predicate, i.e., a total of annotations (2 variants of set predicates 4 KBs 100 predicates 3 judgments). The options in the annotation task is graded. The options have weights {Yes: 1, Maybe yes: 0.75, Do not know: 0.5, Maybe no: 0.25, No: 0}, such that the final label is the weighted average over all judgements between . We remove annotations which have divided agreements, i.e., all annotations whose final score is the open interval . A binary label of is assigned to a set predicate if it has a score between , else it is labelled 0. The judgements for the counting predicate annotations were in agreement, and for the enumerating predicate annotations, the judgements were in agreement. Thus, we obtained our training data, with 39 positive and 306 negative data points for the counting classifier and, 133 positive and 195 negative data points for the enumerating classifier.

For the alignment step, evaluation data was prepared by collecting relevance judgements from crowd workers. We randomly chose enumerating and counting predicates as returned by our classifiers. We then created the set of top-3 counting predicates returned by all the alignment heuristics for each enumerating predicate, so that for each enumerating predicate we had up to 24 counting predicates as candidates. We repeated the step with the counting predicates, this time returning the top-3 enumerating predicates for each counting predicate. On an average, there were 5 candidates for each set predicate in the enumerating and counting case. The annotation task asked each worker to judge the topical relatedness of a pair of set predicates (an enumerating and a counting predicate) and the degree of completeness based on the integer value of the counting predicate and the entities covered by the enumerating predicate with respect to a subject. An example task where the system returns a counting predicate is as follows.

Subject Predicate Object
Query
Univ. of California, L.A. institution Thomas Sowell, Harold Demsetz ..(5 in total)
Result
Univ. of California, L.A. faculty size 4016

We ask the following two questions.
1. Topical relatedness of institution to faculty size is:

Options: High Moderate Low None.2. Enumeration of the objects in the query is:

Options: Complete Incomplete Unrelated.

The task in the opposite direction is designed in a similar fashion with the query containing a counting fact and the result, an enumerating fact with the set of objects.

For this task also we collected three judgements per predicate pair in either direction. We use a graded relevance system by calculating a mean score of the two responses where the grades for topical relatedness are {High: 1, Moderate: 0.67, Low: 0.33, None: 0} and for the completeness of enumeration we have {Complete: 1, Incomplete: 0.5, Unrelated: 0}. Thus the graded relevance score (1 being the highest and 0 being the lowest) is calculated by mapping the responses to their grades and averaging over all responses.

7.4 Classifier models

We model our classifiers on logistic regression as well as neural networks. However, due to small dataset size, and our interest in interpretable insights, we focus on multiple logistic regression models. We consider a standard

logistic regression model, a logistic regression model with a weakly informative default prior Gelman et al. (2008), a Lasso regularized logistic regression Tibshirani (1996) and a neural

network composed of a hidden layer of size three and sigmoid activation function. Due to the small training set we use Leave-One-Out cross validation to obtain our model performance scores.

All models are compared against a random baseline modelled on the input distribution, i.e., predicting labels at random, with probabilities proportional to label frequency in the training data.

Model Recall Precision F1
Random 12.8 12.8 12.8
Logistic 51.2 19.0 27.7
Prior 48.7 20.2 28.5
Lasso 71.7 23.3 35.1
Neural 35.8 20.8 26.3
Table 2: Performance (precision, recall and F1) of the counting predicate classifier models.
Model Recall Precision F1
Random 40.6 40.6 40.6
Logistic 55.6 51.7 53.5
Prior 55.6 51.0 53.5
Lasso 51.1 59.6 55.0
Neural 53.0 49.6 51.2
Table 3: Performance (precisiom, recall and F1) of the enumerating predicate classifier models.

7.5 Results

a. Classifier model selection

The results of the classifier models are in Tables 2 and 3. As one can see, the Lasso regularized model performs the best for counting predicates with an F1 score of , which is significantly better than the random model which has an F1 score of . We observe that the counting classifier models in general have lower precision scores, but higher recall. The scores of the random model are computed from the training data distribution of counting predicates which contain positive and negative datapoints. Note that the number of datapoints is less than the initial selection of datapoints since, as explained in the previous section, we remove datapoints with divided agreements. We use the Lasso regularized model to classify the counting predicates.

In the enumerating predicate scenario also, the Lasso regularized model has an overall highest performance with an F1 score of . Here too, the random classifier performance depends on the distribution of training data which has positive and negative datapoints, giving an F1 score of . We use the Lasso regularized model for predicting the enumerating predicates. The comparative recall precision scores of the enumerating classifier models can be attributed to the almost equal class distribution in the training data, which is not the case in counting classifiers.

KB Positive samples Negative Samples
DBP-raw 16 62
DBP-map 9 72
WD-truthy 7 87
Freebase 7 85
Total 39 306
Table 4: Distribution of counting classifier training samples across KBs.
KB Positive samples Negative Samples
DBP-raw 33 53
DBP-map 27 58
WD-truthy 27 55
Freebase 46 29
Total 133 195
Table 5: Distribution of enumerating classifier training samples across KBs.

We can conclude from Table 4 that in general, KBs have very few counting predicates which also contributes to the low precision score of the counting classifier. From the numbers in Table 5, we observe that enumerating predicates have a rather balanced distribution.

b. Important Features

The most important features in the counting predicate classifier are the mean and 10th percentile of the count values of a predicate with negative weights of and suggesting that counting predicates usually take smaller integer values. The predicate domain of type Organization has a positive weight of .

The determining features of the enumerating classifier are the type information on the predicate domain and range. For example, the weights for domain Thing and range Organization are positive values of and , respectively. It is interesting to note that predicate ranges of type Work and Place have small negative weights of and , respectively, suggesting that predicates with range type location are less likely to be enumerating predicates.

KB Input Output Filtered
DBP-raw 16,635 4,090 4,090 (24.5%)
DBP-map 1,670 308 308 (18.4%)
WD-truthy 4,067 216 203 (4.9%)
Freebase 13,872 7,752 7,614 (54.8%)
Total 36,244 12,366 12,215 (33.7%)
Table 6: Predicted enumerating predicates across different KBs, where Input is all KB predicates (direct + inverse), Output is from the classifier prediction and Filtered the number of predicates remaining after removing predicates related to IDs and codes.
KB Input Output Filtered
DBP-raw 13,394 5,853 5,853 (43.6%)
DBP-map 1,127 898 898 (79.6%)
WD-truthy 3,346 1,922 1,067 (31.8%)
Freebase 8,289 1,723 1,687 (20.3%)
Total 26,156 10,396 9,505 (36.3%)
Table 7: Predicted counting predicates across different KBs, where Input is KB predicates (direct only), Output is from the classifier prediction and Filtered the number of predicates remaining after removing predicates related to IDs and codes.
KB Counting Predicates
DBP-raw employees, retiredNumbers, crewMembers, postgraduates, members
DBP-map numberOfStudents, facultySize, numberOfGoals, populationAsOf, capacity
WD-truthy employees, numberOfDeaths, numberOfConstituencies, numberOfSeats
FB children, numberOfMembers, population, numberOfStaff, injuries, passengers
Wrong Predictions
DBP-raw linecolor, km, birthyear,
DBP-map foundingYear, keyPerson
WD-truthy publicationDate, coordinateLocation
FB maxLength, height
Table 8: Example predicted counting predicates from the different KBs.

c. Predicted Set Predicates

The number of set predicates predicted by each classifier is shown in Tables 6 and 7 in the Output column. We have predicted as counting predicates by the counting classifier out of . The percentage of predicted counting predicates is almost which is much higher than the class distribution in the training data (). One reason is the very low precision scores of the classifiers which may lead to more false positives. The enumerating classifier predicts () of predicates as enumerating predicates which is closer to the class distribution seen in the training data ().

KB Enumerating Predicates
DBP-raw college, workInstitution, affiliations, members, voiceActor, nativeLangugae, politicalParty
DBP-map recordLabel, developer, product, publisher, formerCoach employer, governor
WD-truthy participantOf, airlineHub, developer, father, sponsor
FB actor, member, starring, publisher, airportsServed, foundedLocation
Wrong Predictions
DBP-raw currentTeam, deathCause, weightClass
DBP-map secondTeam, genre
WD-truthy parentOrganization, hairColor
FB cameras, burstCapability, founder

,

Table 9: Example predicted enumerating predicates from the different KBs.

We illustrate some predicted counting predicates in Table 8 and a few enumerating predicates in Table 9. The DBpedia raw KB predicate voiceActor, for example, connects a voice actor to the associated shows555List of shows Mel Blanc voiced over https://tinyurl.com/dbpedia-mel-blanc and employees666Sample of subjects with the predicate employees https://tinyurl.com/dbpedia-employees gives the number of employees in an organization.

The classifiers also misclassify as shown in previous tables, for example, the counting classifer wrongly predicts dates like birthYear and foundingYear, measurements such as km, height as counting predicates. The enumerating classifier makes errors by positively labelling functional and pseudo-functional predicates like currentTeam, sourceOfIncome.

Class Pre-filter Post-filter Classifier
Enumerating 2,167 151 2,016 ()
Counting 2,158 891 1,277 ()
Table 10: The number of identifier predicates present in the input to the classifiers Pre-filter vs. the number present in the predicted predicates Post-filter successfully removed by the classifiers.

d. Filtering identifier labels

Our classifiers, especially the counting classifier has lower precision than recall. One of the commonly occurring type of predicates are identifiers, which may be represented as a fact with an large number in integer or string format and, we can remove such predicates without losing any actual set predicate. The filtering is done by checking for the presence of the words ’id’ and ’code’ as substrings, but not part of a longer word, in the predicate label, irrespective of the source KB. In Table 10 we compare the number of identifier predicates that need to be filtered before classification versus the number of predicates filtered after classification. The enumerating classifier is good at filtering identifier predicates since almost of the identifier labels are predicted to be false. The counting classifier removes around of the identifier predicates and could benefit from the identifier filter. Thus we apply the identifier label filter only on the output of the classifiers, and the final numbers are shown in the Filtered column of Tables 6 and 7.

Metric Counting Enumerating
Absolute 0.71 0.56 0.62 0.63
Jaccard 0.76 0.61 0.69 0.67
Conditional 0.71 0.56 0.68 0.67
Conditional 0.76 0.68 0.62 0.63
P’wiseMI 0.73 0.58 0.71 0.70
P’fectMR 0.70 0.57 0.73 0.72
Correlation 0.77 0.69 0.62 0.61
P’tileVM 0.72 0.57 0.65 0.65
CosineSim 0.79 0.61 0.74 0.73
Combined 0.84 0.67 0.75 0.75
Table 11: Average NDCG scores for the alignment stage.

e. Statistical alignment

The predicate pair , where is a predicted enumerating predicate set and is a predicted counting predicate, is a possible alignment. However, co-occurring predicate pairs have a long tail of infrequent pairs. Of all co-occurring pairs we consider alignments which co-occur for at least 50 subjects.

The NDCG Järvelin and Kekäläinen (2002) scores reported in Table 11 are an evaluation of the top three alignments from all nine alignment metrics based on relevance judgments collected from crowd workers. We report the NDCG at positions 1 and 3. The table is divided into the three alignment families and we consider two directions. The first is the direction from a counting predicate to its enumerating predicate alignments and the second is the reverse.

Based on the scores presented in Table 11

, we can conclude that the linguistic similarity metric of cosine similarity 

6 performs the best individually, except for NDCG@3 for the counting to enumerating direction, where the Pearson correlation measure performs best. The Correlation metric in the counting to enumerating direction and the P’fectMR metric in the reverse direction are the best performing metrics of the set predicate value distribution family6. The strongest metrics in the set predicate co-occurrence family 6 are Conditional in the direction of counting to enumerating predicate alignment and P’wiseMI in the other direction.

The Combined metric takes the best performing metric from each family and computes the mean of the alignment scores to obtain a combined score which gives better results than any individual metric. We use this combined measure to rank our alignments.

8 Use Cases

Question answering

Figure 3: Demo system for question answering.

We are in the process of building a demo that highlights how aligned set predicates can enrich simple single-triple queries. In the interface, users can input a simple Wikidata or DBpedia query by specifying a predicate and a subject or an object. Along with the results for the other field, the interface will then show a ranked list of aligned set predicates, both, those having values (bottom 2 in Fig. 3), and those having no values (top 3 in same Figure). This makes the demo relevant for two use cases: (i) KB curation, by checking which related predicates have missing information so far, and (ii) question answering, by enhancing count questions with instance information, and vice versa. A video of the demo prototype can be found at https://tinyurl.com/y2ka4kfu, and the demo can be accessed at https://counqer.mpi-inf.mpg.de.

KB curation

In this section we look into a few alignments from different KBs and the distribution of their values. The first alignment in Fig. 4 is the pair (academic Affiliation, academicStaff) from the DBpedia raw KB, which co-occurs in 57 subjects. The x-axis represents the number of instances while the y-axis is the value of the counting predicate. Each point in the plot represents a subject which is connected to x instances by the enumerating predicate (here academicAffiliation) and takes the value y for the counting predicate (here, academicStaff). In an ideal condition the count of instances should match the value and all points should lie along the line . Points lying above this line suggest incompleteness. Such is the case in Fig. 4 where the predicate academic Affliation does not give a complete list of instances to match the value of academicStaff.

Figure 4: Value distribution of academicStaff and count of academicAfilliation across 57 subjects in DBpedia raw KB.
Figure 5: Value distribution of counting predicate numberOfEmployees and count of enumerating predicate employer across 278 subjects in DBpedia mapping-based KB.

.

Next we look into an alignment from DBpedia mapping based KB, (employer, numberOfEmployees) in Fig. 5. In this alignment also, the observation is that the enumerated facts are much smaller than the number of employees, typically because such facts exist for the most important employees.

Figure 6: Value distribution of counting predicate memberCount and count of enumerating predicate memberOfPoliticalParty across 62 subjects in Wikidata-truthy KB.

.

Figure 7: Value distribution of counting predicate populationState and count of enumerating predicate placeOfBirth across 48 subjects in Freebase KB.

.

In Fig. 6, we show an alignment from the Wikidata KB which is regarding the members of a political party (memberOfPoliticalParty, memberCount). Similar to the previous trends, here also the number of enumerated facts about the members in a political party is less than the actual value. The final alignment we show is of the pair (placeOfBirth, populationState) in the Freebase KB as shown in Fig. 7. From the numbers it seems that the predicate covers small geographical locations where the number of enumerated facts is more complete than in the previous cases.

Figure 8: Value distribution of the counting predicate venues and count of enumerating predicate stadium for 2179 sports events (left), and numberOfMembers and count of localCouncil for 35 political assemblies (right) from the DBpedia raw KB.

Figure 8 shows the value distribution of an alignment where each cell in position () gives the number of subjects which takes as the value of the counting predicate and number of instances with the enumerating predicate. The first analysis concerns the places where a sports event took place. A notable anomaly in DBpedia raw KB is that regularly, for each venue, both the stadium and the city were recorded. Thus, we plot two green lines showing 1:1 matches, and 2:1 matches. Instances below both lines likely point to incompleteness (some stadiums are missing), instances below both lines likely point to some errors in the data (i.e., too many stadiums added). As one can see, the completeness appears to be relatively high, while there are several cases that deserve closer inspection w.r.t. possible incorrectness.

The second analysis concerns the number of members of local councils, compared with individual members listed. Here incompleteness is prevalent, with typically only 1 in 30 to 1 in 10 members listed.

9 Discussion

Transferability of method

A crucial aspect for our framework is whether it can be utilized on new KBs without requiring too much adaptation. Our modular framework is aimed towards this purpose. The supervised predicate classification stage allows to transfer our approach by only creating new training instances. For KBs where textual predicates are unavailable, a sensible extension is the incorporation of latent representations of predicates Wang et al. (2014); Lin et al. (2015).

Indirect alignments

The alignments are also helpful in identifying redundancies in schema, where two or more set predicates (enumerating/counting) describing the same concept exist and are all aligned to a single set predicate of the other variant. For example the enumerating predicate affiliation in the DBpedia mapping-based KB aligns with the counting predicates {facultySize, staff, numberOfStaff}.

Multi-hop alignments

Counting predicates may well align with multi-hop paths of enumerating predicates. For instance, an interesting near-subset of population(x,y) is worksAt(y,z), basedIn(z,x). The search space for such alignments would grow quadratically, but clever pruning may keep it manageable.

Crowd annotation costs

The cost of the annotating classifier training data is approximately per task per judgment. For the alignment evaluation task the cost was almost per task per judgment. Thus the average cost per task is if we collect 3 judgments. Thus, crowd sourcing becomes expensive if the number of tasks is in the order of thousands or greater.

Open information extraction

So far we have only considered the alignment of canonicalized KB predicates. An interesting direction would be to extend this alignment towards open information extraction and open knowledge bases in the style of Reverb Fader et al. (EMNLP 2011), i.e., to align textual phrases like “X has Y employees” with phrases like “Z works at Y”, “Z recently joined X”, etc. Numeric open information extraction traditionally focuses on temporal information Ling and Weld (AAAI 2010) and measures Saha et al. (ACL 2017), though there are also some recent works on counting information extraction Mirza et al. (ACL 2017, ISWC 2018), which one might build upon.

10 Conclusion

In this paper we have introduced the problem of set predicate alignment, and presented CounQER, a methodology for identifying and linking set predicates that combines co-occurrence, correlational and linguistic features. Our next goals are to extend this methodology to multi-hop alignments, and towards open predicate phrases extracted from natural language texts.

Bibliography

References

  • A. Algergawy et al. (2018) Results of the ontology alignment evaluation initiative 2018. In Ontology Matching workshop, Cited by: §2.
  • S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives (ISWC 2007) DBpedia: a nucleus for a web of open data. Cited by: §1, §7.1.
  • H. Bast and E. Haussmann (CIKM 2015) More accurate question answering on Freebase. Cited by: §2.
  • N. Boldyrev, M. Spaniol, and G. Weikum (2018) Multi-cultural interlinking of web taxonomies with ACROSS. Journal of Web Science. Cited by: §1, §2.
  • K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor (2008) Freebase: a collaboratively created graph database for structuring human knowledge. In SIGMOD, Cited by: §1, §7.1.
  • D. Calvanese, B. Cogrel, S. Komla-Ebri, R. Kontchakov, D. Lanti, M. Rezk, M. Rodriguez-Muro, and G. Xiao (2017) Ontop: answering sparql queries over relational databases. Semantic Web Journal. Cited by: item (ii).
  • D. Calvanese, T. Eiter, and M. Ortiz (AAAI 2009) Regular path queries in expressive description logics with nominals. Cited by: §2.
  • D. Calvanese, M. Lenzerini, and D. Nardi (1998) Description logics for conceptual data modeling. In Logics for databases and information systems, Cited by: §2.
  • J. Euzenat and P. Shvaiko (2007) Ontology matching. Springer. Cited by: §1, §2.
  • A. Fader, S. Soderland, and O. Etzioni (EMNLP 2011) Identifying relations for open information extraction. Cited by: §9.
  • W. Fan, Y. Wu, and J. Xu (SIGMOD 2016) Adding counting quantifiers to graph patterns. Cited by: §2.
  • A. Gelman, A. Jakulin, M. G. Pittau, Y. Su, et al. (2008) A weakly informative default prior distribution for logistic and other regression models. The Annals of Applied Statistics. Cited by: §7.4.
  • B. Glimm, C. Lutz, I. Horrocks, and U. Sattler (2008) Conjunctive query answering for the description logic SHIQ. JAIR. Cited by: §2.
  • B. Hollunder and F. Baader (1991) Qualifying number restrictions in concept languages.. KR. Cited by: §2.
  • P. Jain, P. Hitzler, A. P. Sheth, K. Verma, and P. Z. Yeh (ISWC 2010) Ontology alignment for linked open data. Cited by: §1, §2.
  • K. Järvelin and J. Kekäläinen (2002) Cumulated gain-based evaluation of IR techniques. TOIS. Cited by: §7.5.
  • M. Koutraki, N. Preda, and D. Vodislav (2017) Online relation alignment for linked datasets. In ESWC, Cited by: §2.
  • J. Lehmann, R. Isele, et al. (2015) DBpedia–a large-scale, multilingual knowledge base extracted from Wikipedia. Semantic Web Journal. Cited by: §7.1.
  • Y. Lin, Z. Liu, M. Sun, Y. Liu, and X. Zhu (2015)

    Learning entity and relation embeddings for knowledge graph completion

    .
    In AAAI, Cited by: §9.
  • X. Ling and D. S. Weld (AAAI 2010) Temporal information extraction. Cited by: §2, §9.
  • D. L. McGuinness, F. Van Harmelen, et al. (2004) OWL web ontology language overview. W3C recommendation. Cited by: §2.
  • P. Mirza, S. Razniewski, F. Darari, and G. Weikum (ACL 2017) Cardinal virtues: Extracting relation cardinalities from text. Cited by: §2, §9.
  • P. Mirza, S. Razniewski, F. Darari, and G. Weikum (ISWC 2018) Enriching knowledge bases with counting quantifiers. Cited by: item (i), §2, §2, §9.
  • S. Neumaier, J. Umbrich, J. X. Parreira, and A. Polleres (ISWC 2016) Multi-level semantic labelling of numerical values. Cited by: §2.
  • M. Niepert, C. Meilicke, and H. Stuckenschmidt (AAAI 2010) A probabilistic-logical framework for ontology matching. Cited by: §1, §2.
  • J. Pennington, R. Socher, and C. D. Manning (EMNLP 2014) Glove: global vectors for word representation. Cited by: item 8.
  • E. Rahm and P. A. Bernstein (2001) A survey of approaches to automatic schema matching. VLDB Journal. Cited by: §1, §2.
  • S. Razniewski, F. Suchanek, and W. Nutt (2016) But what do we actually know?. In AKBC Workshop, Cited by: §2.
  • S. Saha, H. Pal, et al. (ACL 2017) Bootstrapping for numerical Open IE. Cited by: §2, §9.
  • P. Shvaiko, J. Euzenat, E. Jiménez-Ruiz, M. Cheatham, and O. Hassanzadeh (2018) Ontology matching workshop. CEUR-WS. Cited by: §2.
  • P. Shvaiko and J. Euzenat (2013) Ontology matching: state of the art and future challenges. TKDE. Cited by: §1, §2.
  • J. Subercaze (ESWC 2017) Chaudron: extending DBpedia with measurement. Cited by: §2.
  • F. M. Suchanek, S. Abiteboul, and P. Senellart (2011) PARIS: probabilistic alignment of relations, instances, and schema. VLDB. Cited by: §1, §2, §3.
  • F. M. Suchanek, G. Kasneci, and G. Weikum (WWW 2007) YAGO: a core of semantic knowledge. Cited by: §1, §7.1.
  • R. Tibshirani (1996) Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological). Cited by: §7.4.
  • D. Vrandečić (WWW 2012) Wikidata: a new platform for collaborative data collection. Cited by: §1, §7.1.
  • Z. Wang, J. Zhang, J. Feng, and Z. Chen (2014)

    Knowledge graph embedding by translating on hyperplanes

    .
    In AAAI, Cited by: §9.
  • Z. Wang, J. Li, Y. Zhao, R. Setchi, and J. Tang (2013) A unified approach to matching semantic data on the web. Knowledge-Based Systems. Cited by: §1, §2.
  • D. Wienand and H. Paulheim (ESWC 2014) Detecting incorrect numerical data in DBpedia. Cited by: §2, §4.
  • H. Wu, B. Villazon-Terrazas, J. Z. Pan, and J. M. Gomez-Perez (2014) How redundant is it?: an empirical analysis on linked datasets. COLD. Cited by: §4.
  • A. Zaveri, A. Rula, A. Maurino, R. Pietrobon, J. Lehmann, and S. Auer (2016) Quality assessment for linked data: a survey. Semantic Web Journal. Cited by: §4.