The sameAs Problem: A Survey on Identity Management in the Web of Data

07/24/2019 ∙ by Joe Raad, et al. ∙ Vrije Universiteit Amsterdam 0

In a decentralised knowledge representation system such as the Web of Data, it is common and indeed desirable for different knowledge graphs to overlap. Whenever multiple names are used to denote the same thing, owl:sameAs statements are needed in order to link the data and foster reuse. Whilst the deductive value of such identity statements can be extremely useful in enhancing various knowledge-based systems, incorrect use of identity can have wide-ranging effects in a global knowledge space like the Web of Data. With several works already proven that identity in the Web is broken, this survey investigates the current state of this "sameAs problem". An open discussion highlights the main weaknesses suffered by solutions in the literature, and draws open challenges to be faced in the future.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the era where the field of Artificial Intelligence (AI) is strongly dominated by Machine Learning, it is sometimes forgotten that the past decade has also seen a major breakthrough in Knowledge Representation (KR). Through the combination of web-technologies and a judicious choice of formal expressivity (description logics which correspond to a decidable 2-variable fragment of first order logic), it has become possible to construct and reason over knowledge graphs of sizes that were not imaginable only few years ago. Nowadays, knowledge graphs of hundreds of millions of statements are routinely deployed by researchers from various fields (e.g. computer science, medicine, humanities), and companies worldwide (e.g. Google, Bing, Facebook). Since these knowledge graphs are mostly developed independently of one another, it is important that different organisations adhere to common principles and standards for encoding and publishing their knowledge. The most adopted set of principles were laid out by Tim Berners-Lee in 2010, and are known as the Linked Open Data (LOD) principles

111https://www.w3.org/DesignIssues/LinkedData.html. The idea is by providing simple best practices for creating structured data, publishers can also enrich, access, and benefit from a larger decentralised knowledge graph, known as the Web of Data.

In such a large and distributed knowledge graph, it is common practice for the same real-world entity to be described in different knowledge graphs. In the absence of a central naming authority in the Web of Data, it is unavoidable for this same real-world entity to be denoted by different names (IRIs, literals, blank nodes). Hence, essential to the coherence of these large and geographically distributed knowledge graphs, publishers are encouraged to link their data. Such interlinking is typically established by asserting that two names denote the same real-world entity. For this purpose, the Web Ontology Language (OWL) introduced in 2004 the owl:sameAs predicate222https://www.w3.org/TR/owl-ref#sameAs. For instance, owl:sameAs, states that both names from the DBpedia and Wikidata knowledge graphs refer to the same entity. With its strict logical semantics, this statement indicates that every property asserted to one name will also be inferred to the other. Hence, allowing both names to be used interchangeably in all contexts.

While such inferences can be extremely useful in enhancing a number of knowledge-based systems (e.g. providing more coverage and context for search engines, virtual assistants and recommendation systems), incorrect use of identity can have wide-ranging effects in a global knowledge space like the Web of Data. In fact, a number of studies over the years have already shown that identity is misused, estimating the number of existing erroneous

owl:sameAs in the Web of Data to be between 2.8% [Hogan et al.2012] and 20% [Halpin et al.2010]. In addition, by exploiting the semantics of owl:sameAs and computing the transitive closure of over half a billion statements, [Raad et al.2018] showed the effects of such identity misuse in practice. Specifically, it shows that whilst in some cases identity misuse results in the false equivalence of semantically close entities (e.g. Barack Obama and the Obama administration), other cases have resulted in the false equivalence of over 177K names referring to a number of different countries, cities and people. With such findings leaving many uncertainties over the quality and usability of the Web of Data in its current state, a proper approach towards the handling of identity links is required in order to make the Web of Data succeed as an integrated knowledge space.

This survey provides the first overview of existing approaches to this widely recognised identity problem in the Web of Data, known as the “sameAs problem” [Halpin et al.2010]. It describes these different solutions, discusses their strengths and limitations, and formulates open challenges. This survey does not cover related but distinct research topics such as entity resolution [Ferrara et al.2013, Nentwig et al.2017] and ontology alignment [Euzenat and Shvaiko2013], that focus on techniques and frameworks for establishing owl:sameAs links. In addition, this survey does not address the historically significant distinction between locating an electronic document with a URL and denoting an RDF resource with an IRI, known as the problem of ‘Sense and Reference’ [Halpin2010]. The rest of this paper is structured as follows. Section 2 gives an overview of the various aspects of the identity problem. Section 3 presents existing alternative identity relations to owl:sameAs. Section 4 gives an overview on proposed strategies and services for managing identity in the Web. Section 5 covers existing approaches for the detection of erroneous identity links, and Section 6 concludes and formulates open challenges.

2 Identity Overview

Identity is an old and thorny topic. Classically speaking, entities that are identical are considered to share the same properties. With denoting the set of all names, and the set of all properties, this ‘Indiscernibility of Identicals’ (1) is attributed to Leibniz and its converse, the ‘Identity of Indiscernibles’ (2) states that entities that share the same properties are identical. That identity is reflexive, symmetrical and transitive also follows from Leibniz’s Law.

(1)
(2)

This identity relation induces a partitioning of into a collection of non-empty and mutually disjoint equivalence classes . From the premises , and , it follows that is also the case. In fact, this deduction is central to the Web of Data as it allows complementary descriptions of the same resource to be maintained locally, yet interchanged globally, merely by interlinking the names that are used in those respective descriptions. However, there are also problems with it, and – consequently – criticisms have been levelled against it. These problems are not new, neither specific to the Web of Data, as they are present in all KR systems [Grant and Subrahmanian1995, Nguyen2007]. However, the problems are specifically pressing in the Web of Data due to its unprecedented size, the heterogeneity of its content and users, and the absence of a central naming authority. This section briefly presents some of the well-known issues with this notion of identity.

2.1 Philosophical Problems

From a philosophical point of view, we present the two major issues with this notion of identity. Firstly, identity over time poses problems, since a ship333Reference to the ship of Theseus or Theseus’s paradox may still be considered the same ship, even though some, or even all, of its original components (i.e. properties) have been replaced by new ones [Lewis1986]. In addition, identity is context-dependent [Geach1967], allowing two medicines, having the same chemical structure, to be considered the same in a medical context, but to be considered different in other contexts (e.g. because they are produced by different companies). These issues in the classical identity definition have led to various philosophical theories, such as the distinction between accidental properties (traits that could be taken away from an object without making it a different thing), and essential properties (core elements needed for a thing to be the thing that it is) [Kripke1972].

2.2 Practical Problems

Given that this problematic notion of identity is also standardised as part of the Web Ontology Language, it is normal to encounter these issues in Web applications. In fact, and due to the Open World Assumption and the continuous increase of , identity statements in the Web of Data are even more controversial. Firstly, unless two things are explicitly said to be different (e.g. using owl:differentFrom), the absence of an identity statement between them does not mean that they are not identical. Compared to the 558M owl:sameAs present in a 2015’s crawl of the Web of Data [Fernández et al.2017], this type of statements is barely present in the Web of Data, with only 3.6K owl:differentFrom statements existing at that time in the same dataset. In addition, most owl:sameAs

links are generated by heuristic entity resolution techniques, that employ practical strategies which are not guaranteed to be accurate. For instance, the precision of such tools ranged between 67% and 86% in the 2017 and 2018 Ontology Alignment Evaluation Initiative (OAEI)

444http://oaei.ontologymatching.org/2018/results/conference/index.html. Finally, studies have shown that modellers have different opinions about whether two objects are the same or not. For instance in [Halpin et al.2010], three KR experts were asked to judge 250 owl:sameAs links collected from the Web. The evaluation shows high disagreements, with one judge confirming the correctness of only 73 owl:sameAs statements, whilst the two other experts judging up to 132 and 181 links as true. While in some cases this may be due to differences in modelling competence, there is also the problem that two modellers may consider different parts of the same knowledge graph within different contexts.

3 Alternative Identity Links

Given these presented problems in owl:sameAs, a number of vocabularies and approaches have proposed alternative identity relations. This section presents the most deployed alternatives and gives an overview of their usage in Table 1.

3.1 Weak-Identity and Similarity Predicates

SKOS predicates. Introduced as lighter alternatives for owl:sameAs with skos:closeMatch indicating that “two concepts are sufficiently similar that they can be used interchangeably in some applications”, and skos:exactMatch indicating “a high degree of confidence that the concepts can be used interchangeably across a wide range of applications”.
wdt:P2888. In Wikidata the exact match predicate (P2888), declared as equivalent to skos:exactMatch, is deployed for linking concepts.
umbel:isLike. This symmetrical relation was introduced by the UMBEL vocabulary to “assert an associative link between similar individuals who may or may not be identical, but are believed to be so”.
Similarity Ontology. [Halpin et al.2010] introduced eight new predicates hierarchically represented with existing RDFS, OWL and SKOS predicates. Each predicate in this ontology is also characterised by reflexivity, transitivity and symmetry. The most specific predicate in this ontology is owl:sameAs, and the most general ones are so:claimsRelated and so:claimsSimilar.

Property Unique Triples Unique Names
owl:sameAs 558,943,116 179,739,567
skos:exactMatch 566,137 1,087,866
umbel:isLike 461,054 478,474
skos:closeMatch 371,011 647,230
wdt:P2888 356,648 696,535
Table 1: Overview of the usage of alternative identity links, based on a 2015 crawl of the Web of Data, and Wikidata for wdt:P2888.

3.2 Contextual Identity

The standardised semantics of owl:sameAs

can be thought of as instigating an implicit context that is characterised by all (possible) properties to have the same values for the linked names. Weaker types of identity can be expressed by considering a subset of properties with respect to which two resources can be considered the same. At the moment, the way of encoding contexts on the Web is largely ad hoc, as contexts are often embedded in application programs, or implied by community agreement. The issue of deploying contexts in KR systems has been extensively studied in AI

[Guha1991]. In the Web of Data, explicit representation of context has been a topic of discussion since its early days [Bouquet et al.2003], where the variety and volume of the web poses a new set of challenges than the ones encountered in previous AI systems. This section presents approaches focusing on the specific issue of representing contextual identity in the Web.

In [Beek et al.2016], a context is defined as a subset of all properties which are necessary and sufficient to determine indiscernibility and hence identity:

(3)
(4)

Looking back to the example in Section 2.1, two medicines with the same chemical structure, but produced by different companies, are identical in the context where the property specifying the medicine’s commercial supplier is discarded (i.e. ). In [Raad et al.2017], this notion of contextual identity is encoded in RDF, and the definition of a context is extended to a sub-graph of the domain ontology called a global context. Specifically, a global context is composed of a subset of classes and properties of an ontology , and a set of axioms which are limited to constraints on property domains and ranges. These axioms allow the parameterization of the identity criteria with respect to each class of the ontology. For instance, allowing to express that two medicines are considered identical if they have the same quantity of elements of type , whilst disregarding the quantity of its other elements. The identity relation between two class instances in a global context is based on the notion of graph isomorphism of their descriptions, where an approach is proposed for automatically detecting these global contexts.

With both these approaches unclear about the treatment of properties that do not belong to the identity context (i.e. or ), a richer definition of context was proposed by [Idrissou et al.2017]. It defines a context by two sets of properties, for indiscernibility and for propagation:

(5)
(6)
(7)

Principles (5) and (6) refers to the same notion of contextual identity defined in [Beek et al.2016], whilst (7) defines the notion of contextualised propagation. Note that unlike , indiscernibility in does not determine identity. For instance, in a scientific context, two medicines sharing the same chemical structure is enough to consider them identical, and infer that they share the same purpose . However, two medicines with the same do not necessarily share the same . This approach extends a previous approach by [Batchelor et al.2014], mainly in the way of parametrizing the propagation context , and the way these contextual identity links are encoded in RDF (on the triples level instead of the graphs level).

4 Identity Management Services

Instead of proposing alternative identity relations for limiting the misuse of owl:sameAs, other approaches have proposed services for managing identity in the Web of Data. These services share the common goal of helping users or applications to identify names referring to the same real world entity, and distinguish between similar labels referring to different real world entities. For instance, in order to avoid using a name referring to the river of Niger, while intending in using one referring to the country of Niger, one could benefit from such services for re-using an existing universal identifier that unambiguously refers to a specific real-world entity (e.g. river of Niger). Such type of services have a more centralised vision for identity management in the Web of Data, in which each real-world entity is referenced by a single centralised name. On the other hand, one can make use of other types of services that provide centralised access to identity statements that are published in a decentralised way. Such identity observatories allow Web consumers to make an informed decision regarding the quality of identity statements they encounter. Such services can also play an important role in enabling large scale identity analysis in the Web [Beek et al.2018], implementing and optimising linked data queries in the presence of co-reference [Schlegel et al.2014], and detecting erroneous identity statements [de Melo2013]. This section gives an overview of existing identity services.

4.1 Centralised Identity Management

In the early days of the Web, it was originally conceived that resource identifiers would fall into two classes: locators (URLs) to identify resources by their locations, and names (URNs) for assigning location-independent, globally unique, and persistent identifiers [Mealling and Daniel1999]. With URNs, each identifier has a defined namespace that is registered with the Internet Assigned Numbers Authority (IANA). For instance, urn:isbn:0451450523 is a URN that identifies the book “The Last Unicorn”, using the ISBN registered namespace. Because of the lack of a well-defined resolution mechanism, and the organisational hurdle of requiring registration with IANA, URNs are hardly used (total of 47K URNs in a 2015 crawl of the Web of Data, with only 73 registered555https://iana.org/assignments/urn-namespaces URN namespaces with IANA at the time of writing). Since 2005, the use of the terms URNs and URLs has been deprecated in favour of the terms URI which encompasses both, and IRI that extends the URI character set. A more recent centrally managed naming service was proposed by [Bouquet et al.2007]. This public entity name service named Okkam666as a variation of Occam’s razor, intends to establish a global digital space for publishing and managing information about entities, with the idea of encouraging people to reuse existing names instead of creating new ones. Every entity is uniquely identified with an unambiguous universal name known as an OKKAM ID, and is matched to a set of existing names (e.g. DBpedia and Wikidata names). In addition, for each OKKAM entity, a set of attributes are collected and stored in the service for the purpose of finding and distinguishing entities from another. However, this public service777hosted at http://okkam.org is no longer maintained, with no information on the number of existing entities and links.

4.2 Identity Observatories

In recent years, identity observatories have gained more popularity. These web services, compared in Table 2, allow users to find for a given name, the list of names that belong to the same equivalence class. Whilst in recent services, these equivalence classes are based solely on the transitive closure of owl:sameAs statements, the Consistent Reference Service 888hosted at http://sameas.org [Glaser et al.2009] incorporates a mix of identity and similarity relationships (such as owl:sameAs, umbel:isLike, and the SKOS predicates), harvested from multiple RDF dumps and SPARQL endpoints. On the other hand, the LODsyndesis999hosted at http://www.ics.forth.gr/isl/LODsyndesis co-reference service is based on the transitive closure of solely owl:sameAs statements harvested from existing data dumps (e.g. datahub.io, subsets of DBpedia and Wikidata). Finally, the recent co-reference service101010hosted at http://sameas.cc introduced by [Beek et al.2018] provides the largest collection of owl:sameAs with their equivalence closure collected from a 2015 crawl of the Web of Data.

sameas.org LODsyndesis sameas.cc
# Names 203,953,936 65,315,931 179,739,567
# Statements 346,425,685 44,028,829 558,943,116
# owl:sameAs Unknown 44,028,829 558,943,116
# Partitions 62,591,808 24,076,816 48,999,148
# Eq. Classes Unknown 24,076,816 48,999,148
Table 2: Overview of Existing Identity Observatories

5 Erroneous Identity Links Detection

Finally, an important aspect of limiting the “sameAs problem” is the detection of incorrectly asserted identity links. In order to detect such incorrect links, various kinds of information may be exploited: RDF triples related to the linked resources, domain knowledge that is described in the ontology or that is obtained from experts, or different network metrics. This section presents existing approaches, classified into three – possibly overlapping – categories. Table

3 provides an overview of these approaches.

Approach Type of Approach Requirements Evaluated Data Results
Transparency
[CudreMauroux et al.2009]
Inconsistency-based
- Source Trustworthiness
- Presence of owl:differentFrom
Synthetic graph of
8K entities and 24K links
75% to 90% accuracy
-
[Hogan et al.2012]
Inconsistency-based
Ontology Axioms
3.77M owl:sameAs from a
2010 crawl of 3.9M Web documents
85% precision, 40% recall (only
280 inconsistent classes out of 2.8M)
-
[Papaleo et al.2014]
Inconsistency-based
and Content-based
- Ontology Axioms
- Ontology Mappings
344 owl:sameAs produced by
3 different linking tools (OAEI 2010)
37% to 88% precision, 75% to 100%
recall (depending on the dataset)
D
[de Melo2013]
Inconsistency-based
UNA
BTC2011: 3.4M owl:sameAs and
sameAs.org: 22.4M owl:sameAs
no precision or
recall evaluation
D
[Valdestilhas et al.2017]
Inconsistency-based
UNA
LinkLion: 19.2M owl:sameAs
no precision or
recall evaluation
D, T, R
[Paulheim2014]
Content-based

(outlier detection)

-
Peel-DBpedia: 2K owl:sameAs
DBTropes-DBpedia: 4.2K owl:sameAs
- 58% to 80% AUC
- 50% F1-measure
D, T
[Cuzzola et al.2015]
Content-based
(natural language analysis)
Textual Description
for each resource
sameas.org: 411 owl:sameAs
(from 7K collected ones before cleansing)
93% precision
75% recall
-
[Guéret et al.2012]
Network Metrics
(local network)
-
SILK framework: 100 owl:sameAs
49% precision
68% recall
D, T, R
[Raad et al.2018]
Network Metrics
(identity network)
-
558.9M owl:sameAs from a
2015 crawl of the Web of Data
93% recall, 40% to 73% precision
(depending on the eq. class size)
D, T, R
Table 3: Overview of erroneous identity links detection approaches, stating their type, requirements, the dataset on which the experiments were conducted, and the reported results. Transparency indicates whether the dataset (D), the tool (T), and the results (R) were made available.

5.1 Inconsistency-based Detection Approaches

This category of approaches hypothesises that owl:sameAs assertions leading to logical inconsistencies must be wrong.

5.1.1 Conflicting owl:sameAs and owl:differentFrom

In [CudreMauroux et al.2009], these logical inconsistencies are restricted to conflicting owl:sameAs and owl:differentFrom statements. These conflicts are detected based on a graph-based constraint satisfaction problem that exploits the symmetry and transitivity of owl:sameAs statements. These detected conflicts are resolved based on the iteratively refined trustworthiness of the sources declaring the statements (i.e. hypothesises that links published by trusted sources are more likely to be correct). The approach shows high accuracy (75 to 90%), with the evaluation only conducted on synthetic data involving 24K links.

5.1.2 Ontology Axioms Violation

In [Hogan et al.2012], logical inconsistencies are detected after transitive closure, by exploiting ten OWL 2 RL/RDF rules expressing the semantics of axioms such as differentFrom, AsymmetricProperty. When entities causing inconsistencies are detected, they are separated into different seed equivalence classes, then the remaining entities are assigned into one of these seed classes based on their minimum distance in the equivalence class. The approach manages to detect inconsistencies in 280 out of the 2.8M equivalence classes that resulted from the closure 3.7M owl:sameAs. The approach shows high precision (85%) and lower recall (40%). These results also show that consistency does not imply correctness, with 60% of the pairs manually evaluated as being different still belong in the same – now consistent – equivalence class. In [Papaleo et al.2014], the authors exploit class disjointness, (inverse) functional properties and locally complete properties111111multi-valued properties where its information is complete when it is present (e.g., the authors of a certain publication). for detecting inconsistencies. Firstly the approach builds a contextual graph of a specified depth describing the two involved resources in an identity link, then applies a Unit-resolution inference rule until saturation for detecting inconsistencies within these graphs. The approach was evaluated on three datasets with a total of 344 owl:sameAs, showing low precision in two (37% and 42%) and an 88% precision in the third, with a recall between 75% and 100%.

5.1.3 Unique Name Assumption Violation

In [de Melo2013] and [Valdestilhas et al.2017], inconsistencies are detected by presuming that knowledge graphs preserve the Unique Name Assumption (UNA), and that violations of the UNA are indicative of erroneous identity links. The UNA indicates that two names in the same knowledge graph, do not refer to the same real-world entity. Experiments show that both approaches are scalable (tested on 26M and 19M owl:sameAs respectively). However, the precision, recall and accuracy of both approaches have not been evaluated. Interestingly, [de Melo2013] claims that most of the UNA violations stem from incorrect identity links, not from inadvertent duplicates. Whilst in the analysis of a sample of 100 errors, the authors of [Valdestilhas et al.2017] show that 90% of the errors stem from duplications within the dataset, instead of referring to two different real world entities.

5.2 Content-based Approaches

This category of approaches exploits the descriptions associated to each name for evaluating the correctness of an identity link. In [Paulheim2014]

, the author hypothesises that correct identity links follow certain patterns, with ones violating those patterns being probably erroneous. The approach represents each identity link as a feature vector, and tests six different methods for detecting outliers (e.g. one-class support vector machines). The evaluation conducted on two different datasets (2K and 4K

owl:sameAs each), shows a maximum F1-measure of 54%, that varies between each dataset. Finally, the authors in [Cuzzola et al.2015] used DBpedia categories for calculating a similarity score of the textual descriptions associated to (claimed) identical pairs. The approach was tested on 411 owl:sameAs links, with the evaluation suggesting a precision between 86% and 93%, and a recall between 75% and 79%.

5.3 Network-based Approaches

Finally, a last category of approaches used network metrics for evaluating the quality of owl:sameAs links. Whilst in [Raad et al.2018] the exploited (identity) network solely contains owl:sameAs statements, in [Guéret et al.2012] the (local) network considers all properties and names related to the two names linked by an owl:sameAs. Specifically, this approach aims at measuring the impact that a given owl:sameAs has on this local network, using three classic network metrics (clustering coefficient, betweenness centrality, and degree) and two Linked Data-specific ones (description richness and owl:sameAs chains). For instance in the latter, it hypothesises that a correct owl:sameAs will contribute in closing an open owl:sameAs chain. The evaluation was conducted on a set of 100 links, and shows a 49% precision, and 68% recall. In [Raad et al.2018], the approach hypothesises that the more densely a group of names is interlinked, the higher the likelihood of those names to be identical. The approach firstly partitions the identity network into different connected components and then detects the community structure in each of these components. Finally, it assigns an error degree to each owl:sameAs based on the density of the community(ies) in which the two interlinked names belong and the reciprocity of the link. The evaluation was conducted on the sameas.cc dataset, and shows a precision between 40% and 73% and a recall of 93%.

6 Conclusion & Discussion

This survey has presented the first overview in the ongoing process of limiting the excessive and incorrect use of identity links in the Web of Data. We now present the current situation, and set out directions for future work.
Existing identity links lack semantics. In Section 3.1, several alternative identity predicates were presented. A big downside of these alternatives is their lack of formal semantics. For instance, in skos:exactMatch whether a degree of confidence is high (enough) is subjective, and the meaning of this relation even changes over time, because information is always evolving over time. Also, some proposed alternative properties do not denote equivalence relations, which means that they are of limited use in reasoning and linking. Another downside of these approaches is that they require data publishers to change their modelling practice. A lot of momentum is needed in order to create new knowledge graphs, or to change existing ones in order to make use of these alternative properties. As a result, most of these proposals lack uptake and are only used in a handful of datasets (see Table 1).
Contextual identity requires further investigation. In Section 3.2, different proposals for context-dependent semantics of identity were presented. These approaches have the benefit that they do not require existing modelling practices to be changed since the same property (i.e., owl:sameAs) can be used. An exception to this are approaches that require contexts to be modelled by hand. However, contextual semantics has not yet been widely implemented in Linked Data tools, e.g., reasoners, linked data browsers, and faces potential impediments for uptake. In fact, the exact impact of contextual identity on entailment has not been sufficiently investigated. Finally, the use of identity assertions for the purpose of interlinking may be somewhat hampered by contextual semantics approaches. With the traditional semantics of owl:sameAs, linked descriptions can always be shared, but with contextual semantics such descriptions can only be shared if they are asserted in compatible contexts.
Centralised naming authorities will be of limited use. Centralised naming authorities, presented in Section 4.1, play an important role in facilitating the understanding and re-use of names. However, although they might see limited uptake within some dedicated domains, centralised identity management becomes more difficult and error prone when operating at a larger scale. In addition, the idea of having to go through an authority in order to use a new name somewhat goes against the philosophy of the ad hoc nature of the Web, where “anybody is able to say anything about anything”.
Identity Observatories must be used more broadly. Even though several identity observatories exist (Section 4.2), they are not commonly used in Web applications today. This is probably due to the following limitations which these services suffer from, in their current status and architecture.

- Semantic Interpretability. The ‘equivalence classes’ in sameas.org are the result of the transitive closure of a mix of identity relations with different semantics. Since this service does not keep the original predicates, the semantics of the closure that is calculated is unclear (e.g. can not be used by a DL reasoner for inferring new facts).

- Coverage. With the number of statements in LODsyndesis being an order of magnitude smaller than other observatories, this service may see limited use in certain applications.

- Up-to-date support. With sameas.cc being based on a 2015 crawl of the Web, such service may see limited uptake in applications which require more recent information.

We believe that such services will see uptake over time, since they make it possible to use some of the benefits of linking to other knowledge graphs, while at the same time giving the client some control as to which knowledge graphs to link to (and which ones not to link to).
Hybrid error detection approaches are required. Finally, it has now been broadly acknowledged that erroneous identity statements are present in the Web of Data, and that additional effort is needed in order to detect them. In Section 5, we have seen that there are several promising approaches for the (semi-) automatic detection of erroneous identity links. However, all existing approaches have made some trade-off, either having less precision, having less recall, or being less scalable. Specifically, experiments in [Hogan et al.2012] showed that the Web of Data lack from ontological axioms and assertions that are strong enough for deriving inconsistencies. Hence, suggesting that axiom violation-based approaches will mainly have a lower recall. Experiments based on the UNA violation have showed contradicting results, leaving many uncertainties on the effectiveness of the UNA assumption for the task of detecting erroneous links. Existing content-based approaches have showed promising results, but still requires further investigation in terms of their scalability, and whether sufficient textual descriptions in the Web of Data are indeed available. Finally, network-based approaches have also showed promising results in terms of recall and scalability, but existing experiments showed lower precision.

Future research should focus on combining some of these existing approaches in novel ways, potentially combining some of the strengths of these various approaches into one (hybrid) approach. Such an approach should be feasible over the whole Web, where scalability is not the only challenge, but also where certain assumptions on the constant changing data can not be presumed. For instance, in the Web not all names have textual descriptions, many knowledge graphs do not include vocabulary mappings, or lack semantically rich assertions for deriving inconsistencies. In addition, future research should focus on providing more transparency for allowing other approaches to compare, and hopefully improve, their results. Table 3 shows that only three approaches provide fully reproducible results. Finally, compared to the amount of research invested in entity linking [Shen et al.2015] and ontology matching [Ferrara et al.2013], this area is clearly lacking uptake. While in some cases this may be due to various technical challenges (e.g. resulted from the absence of a manually annotated benchmark designed for this task), there is also the aspect that the number and actual effects of these erroneous statements in practice were still unknown, until recently [Raad et al.2018].
With this overview of the current state of the “sameAs problem”, we hope that this survey can lead to the emergence of more efficient approaches and systems for representing contextual identity and investigating its impact at scale, accessing explicit and implicit identity assertions in the Web, and detecting the erroneous ones.

References

  • [Batchelor et al.2014] C. Batchelor, C. Brenninkmeijer, C. Chichester, M. Davies, D. Digles, I. Dunlop, C. Evelo, A. Gaulton, C. Goble, A. Gray, et al. Scientific lenses to support multiple views over linked chemistry data. In ISWC, pages 98–113. Springer, 2014.
  • [Beek et al.2016] W. Beek, S. Schlobach, and F. van Harmelen. A contextualised semantics for owl: sameas. In ISWC, pages 405–419. Springer, 2016.
  • [Beek et al.2018] W. Beek, J. Raad, J. Wielemaker, and F. van Harmelen. sameas. cc: The closure of 500m owl: sameas statements. In ESWC, pages 65–80. Springer, 2018.
  • [Bouquet et al.2003] P. Bouquet, F. Giunchiglia, F. Van Harmelen, L. Serafini, and H. Stuckenschmidt. C-owl: Contextualizing ontologies. In ISWC, pages 164–179. Springer, 2003.
  • [Bouquet et al.2007] P. Bouquet, H. Stoermer, and D. Giacomuzzi. OKKAM: enabling a web of entities. In I3, volume 249 of CEUR Workshop Proceedings, 2007.
  • [CudreMauroux et al.2009] P. CudreMauroux, P. Haghani, M. Jost, K. Aberer, and H. De Meer. idmesh: graph-based disambiguation of linked data. In WWW, pages 591–600. ACM, 2009.
  • [Cuzzola et al.2015] J. Cuzzola, E. Bagheri, and J. Jovanovic. Filtering inaccurate entity co-references on the linked open data. In DEXA, pages 128–143. Springer, 2015.
  • [de Melo2013] G. de Melo. Not quite the same: Identity constraints for the web of linked data. In Twenty-Seventh AAAI Conference on Artificial Intelligence, 2013.
  • [Euzenat and Shvaiko2013] J. Euzenat and P. Shvaiko. Ontology Matching, 2nd Edition. Springer, 2013.
  • [Fernández et al.2017] J. Fernández, W. Beek, M. Martínez-Prieto, and M. Arias. Lod-a-lot. In ISWC, pages 75–83. Springer, 2017.
  • [Ferrara et al.2013] A. Ferrara, A. Nikolov, and F. Scharffe. Data linking for the semantic web. Semantic Web: Ontology and Knowledge Base Enabled Tools, Services, and Applications, 169:326, 2013.
  • [Geach1967] P.T. Geach. Identity. Review of Metaphysics, 21:3–12, 1967.
  • [Glaser et al.2009] H. Glaser, A. Jaffri, and I. Millard. Managing co-reference on the semantic web. In WWW Workshop on Linked Data on the Web, 2009.
  • [Grant and Subrahmanian1995] J. Grant and V. S. Subrahmanian. Reasoning in inconsistent knowledge bases. IEEE Trans. Knowl. Data Eng., 7(1):177–189, 1995.
  • [Guéret et al.2012] C. Guéret, P. Groth, C. Stadler, and J. Lehmann. Assessing linked data mappings using network measures. In ESWC, pages 87–102. Springer, 2012.
  • [Guha1991] R. Guha. Contexts: a formalization and some applications, volume 101. Stanford University Stanford, CA, 1991.
  • [Halpin et al.2010] H. Halpin, P. J Hayes, J. McCusker, D. McGuinness, and H. Thompson. When owl:sameAs isn’t the same: An analysis of identity in Linked Data. In ISWC, pages 305–320. Springer, 2010.
  • [Halpin2010] Harry Halpin. Sense and reference on the web (doctoral dissertation). University of Edinburgh, 2010.
  • [Hogan et al.2012] A. Hogan, A. Zimmermann, J. Umbrich, A. Polleres, and S. Decker. Scalable and distributed methods for entity matching, consolidation and disambiguation over linked data corpora. Web Semantics Journal, 10:76–110, 2012.
  • [Idrissou et al.2017] A. Idrissou, R. Hoekstra, F. van Harmelen, A. Khalili, and P. van den Besselaar. Is my: sameas the same as your: sameas?: Lenticular lenses for context-specific identity. In K-CAP, page 23. ACM, 2017.
  • [Kripke1972] S. Kripke. Naming and necessity. In Semantics of natural language, pages 253–355. Springer, 1972.
  • [Lewis1986] D. Lewis. On the plurality of worlds. Oxford, 14:43, 1986.
  • [Mealling and Daniel1999] M. Mealling and R Daniel. Uri resolution services necessary for urn resolution (rfc 2483), 1999.
  • [Nentwig et al.2017] M. Nentwig, M. Hartung, A. Ngonga Ngomo, and E. Rahm. A survey of current link discovery frameworks. Semantic Web, 8(3):419–436, 2017.
  • [Nguyen2007] N. Nguyen. Advanced methods for inconsistent knowledge management. Springer Science & Business Media, Secaucus, NJ, USA, 2007.
  • [Papaleo et al.2014] L. Papaleo, N. Pernelle, F. Saïs, and C. Dumont. Logical detection of invalid sameas statements in rdf data. In EKAW, pages 373–384. Springer, 2014.
  • [Paulheim2014] H. Paulheim. Identifying wrong links between datasets by multi-dimensional outlier detection. In WoDOOM, pages 27–38, 2014.
  • [Raad et al.2017] J. Raad, N. Pernelle, and F. Saïs. Detection of contextual identity links in a knowledge base. In K-CAP, page 8. ACM, 2017.
  • [Raad et al.2018] J. Raad, W. Beek, F. van Harmelen, N. Pernelle, and F. Saïs. Detecting erroneous identity links on the web using network metrics. In ISWC, pages 391–407. Springer, 2018.
  • [Schlegel et al.2014] K. Schlegel, F. Stegmaier, S. Bayerl, M. Granitzer, and H. Kosch. Balloon fusion: Sparql rewriting based on unified co-reference information. In Data Engineering Workshops, pages 254–259. IEEE, 2014.
  • [Shen et al.2015] Wei Shen, Jianyong Wang, and Jiawei Han. Entity linking with a knowledge base: Issues, techniques, and solutions. IEEE Transactions on Knowledge and Data Engineering, 27(2):443–460, 2015.
  • [Valdestilhas et al.2017] A. Valdestilhas, T. Soru, and A. Ngonga Ngomo. Cedal: time-efficient detection of erroneous links in large-scale link repositories. In ICWI, pages 106–113. ACM, 2017.