The last few years have witnessed that knowledge bases (KBs) have become a valuable asset for many AI applications, such as semantic search, question answering and recommender systems. Some existing KBs, e.g., Freebase, DBpedia, Wikidata and Probase, are very large, containing millions of entities and billions of facts, where a fact is organized as a triple in the form of . However, it has been aware that the existing KBs are likely to have high recall on the facts of popular entities (e.g., celebrities, famous places and award-winning works), but are overwhelmingly incomplete on less popular (e.g., long-tail) entities (Dong et al., 2014; Yu et al., 2019; Razniewski and Weikum, 2018). For instance, as shown in Figure 1, around 2.1 million entities in Freebase have less than 10 facts per entity, while 7,655 entities have more than one thousand facts, following the so-called power-law distribution.
Among those long-tail entities, some just lack facts in KBs rather than in the real world. The causes of the incompleteness are manifold. First, the construction of large KBs typically relies on soliciting contributions from human volunteers or distilling knowledge from “cherry-picked” sources like Wikipedia, which may yield a limited coverage on frequently-mentioned facts (Dong et al., 2014). Second, some formerly unimportant or unknown entities may rise to fame suddenly, due to the dynamics of this ever-changing world (Hoffart et al., 2014). However, current KBs may not be updated in time. Because the Web has become the main source for people to access information nowadays, the goal of this paper is to conjecture what facts about long-tail entities are missing, as well as extract and infer true facts from various Web sources. We believe that enriching long-tail entities with uncovered facts from the open Web is vital for building more complete KBs.
State-of-the-art and limitations. As investigated in (Paulheim, 2018), the cost of curating a fact manually is much more expensive than that of automatic creation, by a factor of 15 to 250. Due to the vast scale of long-tail entities in KBs and accessible knowledge on the Web, automation is inevitable. Existing approaches address this problem from various angles (Getman et al., 2018; Paulheim, 2017; Wang et al., 2017), however, we argue that they may have the following two limitations:
First, existing approaches only deal with a part of the knowledge enrichment problem, such as recommending properties to entities (Zangerle et al., 2016; Lajus and Suchanek, 2018), predicting missing links between entities (Bordes et al., 2013; Dettmers et al., 2018; Shi and Weninger, 2019; Sun et al., 2019) and verifying the truths of facts (Li et al., 2017). Also, the KB population (KBP) or slot filling approaches usually assume that the target properties are given beforehand and extract values from free text (Chen et al., 2014; Surdeanu and Ji, 2014) or structured Web tables (Ritze et al., 2016; Kruit et al., 2019; Yu et al., 2019). To the best of our knowledge, none of them can accomplish open knowledge enrichment alone.
Second, most approaches lack considerations for long-tail entities. Due to the lack of facts about long-tail entities, the link prediction approaches may not learn good embeddings for them. Similarly, the KBP approaches would be error-sensitive for incidentally-appeared entities, as they cannot handle errors or exceptions well. We note that a few works have begun to study the long-tail phenomenon in KBs, but they tackle different problems, e.g., linking long-tail entities to KBs (Esquivel et al., 2017), extracting long-tail relations (Zhang et al., 2019) and verifying facts for long-tail domains (Li et al., 2017).
Our approach and contributions. To address the above limitations, we propose OKELE, a full-fledged approach to enrich long-tail entities from the open Web. OKELE works based on the idea that we can infer the missing knowledge of long-tail entities by comparing with similar popular entities. For instance, to find out what a person lacks, we can see what other persons have. We argue that, this may not be the best solution, but it is sufficiently intuitive and very effective in practice.
Specifically, given a long-tail entity, OKELE aims to search the Web to find a set of true facts about it. To achieve this, we deal with several challenges: First, the candidate properties for a long-tail entity can be vast. We construct an entity-property graph and propose a graph neural network (GNN) based model to predict appropriate properties for it. Second, the values of a long-tail entity are scattered all over the Web. We consider various types of Web sources and design corresponding extraction methods to retrieve them. Third, the extracted facts from different sources may have conflicts. We propose a probabilistic graphical model to infer the true facts, which particularly considers the imbalance between small and large sources. Note that, during the whole process, OKELE makes full use of popular entities to improve the enrichment accuracy.
The main contributions of this paper are summarized as follows:
We propose a full-fledged approach for open knowledge enrichment on long-tail entities. As far as we know, this is the first work attempting to solve this problem.
We propose a novel property prediction model based on GNNs and graph attention mechanism, which can accurately predict the missing properties of long-tail entities by comparison of similar popular entities.
We explore various semi-structured, unstructured and structured Web sources. For each type of sources, we develop the corresponding extraction method, and use popular entities to find appropriate sources and refine extraction methods.
We conduct both synthetic and real-world experiments. Our results demonstrate the effectiveness of OKELE, and also show that the property prediction and fact verification models significantly outperform competitors.
2. Overview of the Approach
In this paper, we deal with RDF KBs such as Freebase, DBpedia and Wikidata. A KB is defined as a 5-tuple , where denote the sets of entities, properties, classes, literals and facts, respectively. A fact is a triple of the form , e.g., . Moreover, properties can be divided into relations and attributes. Usually, it is hard to strictly distinguish long-tail entities and non-long-tail entities. Here, we roughly say that an entity is a long-tail entity if it uses a small number of distinct properties (e.g., ).
Given a long-tail entity in a KB, open knowledge enrichment aims at tapping into the masses of data over the Web to infer the missing knowledge (e.g., properties and facts) of and add it back to the KB. A Web source, denoted by , makes a set of claims, each of which is in the form of , where are a fact, a source and an observation, respectively. In practice, often takes a confidence value about how confident is present in according to some extraction method. Also, has a label .
Figure 2 shows the workflow of our approach, which accepts a long-tail entity as input and conducts the following three steps:
Property prediction. Based on the observations that similar entities are likely to share overlapped properties and popular entities have more properties, we resort to similar popular entities. Our approach first creates an entity-property graph to model the interactions between entities and properties. Then, it employs an attention-based GNN model to predict the properties of the long-tail entity.
Value extraction. For each predicted property of the long-tail entity, our approach extracts the corresponding values from the Web. To expand the coverage and make full use of the redundancy of Web data, we leverage various types of Web sources, including semi-structured vertical websites, unstructured plain text in Web content and structured HTML tables. Prior knowledge from popular entities is used to improve the template-based extraction from vertical websites and the distantly-supervised extraction from text.
Our approach employs an efficient probabilistic graphical model to estimate the probability of each fact being true, based on the observations from various Web sources. To tackle the skewed number of claims caused by the long-tail phenomenon, our approach adopts an effective estimator to deal with the effect of source claim size. Moreover, prior knowledge from popular entities is leveraged to guide the estimation of source reliability. Finally, the verified facts are added into the KB to enrich the long-tail entity.
3. Property Prediction
To leverage similar popular entities to infer the missing properties of long-tail entities, the main difficulties lie in how to model the interactions between entities and properties and make an accurate prediction from a large number of candidates. In this section, we present a new property prediction model based on GNNs (Henaff et al., 2015; Kipf and Welling, 2017) and graph attention mechanism (Veličković et al., 2018).
3.1. Entity-Property Graph Construction
We build entity-property graphs to model the interactions between entities and properties and the interactions between similar entities. An entity-property graph consists of two types of nodes, namely entity nodes and property nodes, and two types of edges: (i) entity-property edges are used to model the interactions between entities and properties. We create an edge between an entity node and a property node if the entity uses the property; and (ii) entity-entity edges are used to model the interactions between similar entities. We create an edge between an entity node and each of its top- similar entity nodes.
Given a , for any two entities , we consider three aspects of similarities between them. Note that, in practice, we sample a small subset of for model training.
Type-based similarity considers the number of types (i.e., classes) that have in common. To emphasize that some types are more informative than others (e.g., actor versus person), we further employ a weighted scoring function:
where . denotes the set of types that directly defines (i.e., without subclass reasoning).
Property-based similarity measures the similarity based on the number of common properties used by :
Value-based similarity calculates the number of values that both have. Analogous to the type-based similarity, we emphasize more informative values:
where (Gunaratna et al., 2015). denotes the set of values that has. For entities and classes, we directly use the URLs, and for literals, we use the lexical forms.
The overall similarity between and is obtained by linearly combining the above three similarities:
where are weighting factors s.t. .
3.2. Attention-based GNN Model
shows our attention-based GNN model for property prediction. Below, we present its main modules in detail. For notations, we use boldface lowercase and uppercase letters to denote vectors and matrices, respectively.
Entity modeling. For each entity , entity modeling learns a corresponding latent vector representation , by aggregating three kinds of interactions, namely entity-property interactions (denoted by ), entity-entity interactions () and itself:
where denotes the embedding of .
is the activation function.and
are the weight matrix and bias vector, respectively.
Specifically, entity-property interactions aggregate information from its neighboring property nodes:
where denotes the embedding of property . is the interaction coefficient between and , and is the neighboring property node set of .
Entity-entity interactions aggregate information from its neighboring entity nodes:
where is the interaction coefficient between and , and is the neighboring entity node set of .
There are two ways of calculating the interaction coefficients and . One fixes under the assumption that all property/entity nodes contribute equally. However, this may not be optimal, because (i) different properties have different importance to one entity. For example, birth_date is less specific than written_book for an author; and (ii) for an entity, its similar neighboring entities can better model that entity. Thus, the other way lets the interactions have uneven contributions to the latent vector representations of entities. Here, we employ the graph attention mechanism (Veličković et al., 2018) to make the entity pay more attention to the interactions with more relevant properties/entities, which are formulated as follows:
where is the concatenation operation. We normalize the coefficients using the softmax function, which can be interpreted as the importance of and to the latent vector representation of .
Property modeling. For each property , property modeling learns a corresponding latent vector representation , by aggregating two kinds of interactions, namely property-entity interactions () and itself:
Specifically, property-entity interactions aggregate information from its neighboring entity nodes:
where is the interaction coefficient between and , and is the neighboring entity node set of .
Similarly, we use the graph attention mechanism to refine :
which is normalized by softmax as well.
Property prediction. For entity and property , we first multiply them as
, which is then fed into a multi-layer perceptron (MLP) to infer the probabilitythat has
using the sigmoid function:
where is the number of hidden layers.
We define the loss function as follows:
where is the true label of having , and are the numbers of entities and properties in the training set. To optimize this loss function, we use Adam optimizer (Kingma and Ba, 2015) and the polynomial decay strategy to adjust the learning rate. The model parameter complexity is , where is the dimension of embeddings and latent vectors, and is the size of attentions.
Example 3.1 ().
Let us see the example depicted in Figure 4. Erika_ Halacher is an American voice actress. She is a long-tail entity in Freebase with two properties film.dubbing_performance.film and person.person.profession. Also, according to the similarity measures, Tress_MacNeille, Frank_Welker and Dee_B._Baker are ranked as its top-3 similar entities. The attention-based GNN model predicts a group of properties for Erika_Halacher, some of which are general, e.g., people.person.nationality, while others are customized, such as film.performance.special_performance_type.
4. Value Extraction
Given an entity and a set of predicted properties, we aim to search the corresponding value collections on the Web, where each value collection is associated with an entity-property pair. However, extracting values for long-tail entities from the Web is hard. On one hand, there are many different types of long-tail entities and their properties are pretty diverse and sparse. On the other hand, a large portion of Web data is un/semi-structured and scattered, there is no guarantee for their veracity. Thus, to improve the coverage of value collections and make full use of the redundancy of Web data, we consider semi-structured vertical websites, unstructured plain text and structured data.
4.1. Extraction from Vertical Websites
Vertical websites contain high-quality knowledge for entities of specific types, e.g., IMDB111https://www.imdb.com is about actors and movies. As found in (Lockard et al., 2018), a vertical website typically consists of a set of entity detail pages generated from a template or a set of templates. Each detail page describes an entity and can be regarded as a DOM tree. Each node in the DOM tree can be reached by an absolute XPath. The detail pages using the same template often share a common structure and placement of content.
Method. We propose a two-stage method to extract values from vertical websites. First, we leverage popular entities in the previous step to find appropriate vertical websites. For a popular entity, we use its name and type(s) as the query keywords to vote and sort vertical websites through a search engine.
Then, we use the known facts of popular entities to learn one or more XPaths for each property. In many cases, the XPaths for the same property are likely to be similar. For example, for property date_of_birth, the XPaths in Tom_Cruise and Matt_Damon detail pages of IMDB are the same. For the case that a page puts multiple values of the same property together, e.g., in the form of a list or table, we merge the multiple XPaths to a generic one. Also, we use CSS selectors to enhance the extraction by means of HTML tags such as id and class. For example, the CSS selector for date_of_birth is always “#name-born-info ¿ time” in IMDB. Thus, for each vertical website, we can learn a set of XPath-property mappings, which are used as templates to extract values for long-tail entities.
Implementation. We employ Google as the search engine, and sample 200 popular entities for each type of entities. According to our experiments, the often used vertical websites include IMDB, Discogs,222https://www.discogs.com, http://www.goodreads.com, https://www.drugbank.ca, http://peakbagger.com GoodReads,33footnotemark: 3 DrugBank,44footnotemark: 4 and Peakbagger.55footnotemark: 5
4.2. Extraction from Plain Text
We use closed IE (information extraction) to extract knowledge from unstructured text. For open IE (Mausam, 2016), although it can extract facts on any domain without a vocabulary, the extracted textual facts are hard to be matched with KBs. We do not consider it currently. Again, we leverage popular entities to improve the extraction accuracy.
Method.et al., 2013). Then, we leverage distant supervision (Mintz et al., 2009) for training. Distantly-supervised relation extraction assumes that, if two entities have a relation in a KB, then all sentences that contain these two entities hold this relation. However, this would generate some wrongly-labeled training data. To further reduce the influence of errors and noises, we use a relation extraction model with multi-instance learning (Lin et al., 2016), which can dynamically reduce the weights of those noisy instances.
Implementation. To get the text for each entity, we use the snippets of its top-10 Google search results and its Wikipedia page. We leverage Stanford CoreNLP toolkit (Manning et al., 2014) together with DBpedia-Spotlight (Daiber et al., 2013)
and n-gram index search for NER. Other NLP jobs are done with Stanford CoreNLP toolkit. For relation extraction, we use the Wikipedia pages of popular entities as the annotated corpus for distant supervision and implement the sentence-level attention over multiple instances with OpenNRE(Lin et al., 2016).
4.3. Extraction from Structured Data
For structured data on the Web, we mainly consider relational Web tables and Web markup data. Previous studies (Ritze et al., 2016; Kruit et al., 2019) have shown that Web tables contain a vast amount of structured knowledge about entities. Additionally, there are many webpages where their creators have added structured markup data, using the schema.org vocabulary along with the Microdata, RDFa, or JSON-LD formats.
Method. For relational Web tables, the extraction method consists of three phases: (i) table search, which uses the name and type of a target entity to find related tables; (ii) table parsing, which retrieves entity facts from the tables. Following (Li et al., 2017), we distinguish vertical tables and horizontal tables. A vertical table, e.g., a Wikipedia infobox, usually describes a single entity by two columns, where the first column lists the properties while the second column provides the values. We can extract facts row by row. A horizontal table often contains several entities, where each row describes an entity, each column represents a property and each cell gives the corresponding value. We identify which row refers to the target entity using string matching, and extract the table header or the first non-numeric row as properties; and (iii) schema matching, which matches table properties to the ones in a KB. We compare the labels of properties after normalization and also extend labels with synonyms in WordNet.
For Web markup data, as the properties from schema.org vocabulary are canonical, and the work in (Tonon et al., 2016) has shown that string comparison between labels of markup entities is an efficient way for linking coreferences, we reuse the entity linking and property matching methods as aforementioned.
Implementation. We collect the English version of WikiTables (Bhagavatula et al., 2013) and build a full-text index for search based on Lucene. We also use the online interface of Google Web tables.666https://research.google.com/tables, http://webdatacommons.org/structureddata For Web markup data, we retrieve from the Web Data Commons Microdata corpus.77footnotemark: 7
5. Fact Verification
Due to the nature of the Web and the imperfection of the extraction methods, conflicts often exist in the values collected from different sources. Among the conflicts, which one(s) represent the truth(s)? Facing the daunting data scale, expecting humans to check all facts is unrealistic, so our goal is to algorithmically verify their veracity.
As the simplest model, majority voting treats the facts claimed by the majority as correct, but it fails to distinguish the reliability of different sources, which may lead to poor performance when the number of low-quality sources is large. A better solution evaluates sources based on the intuition that high-quality sources are likely to provide more reliable facts. However, the reliability is usually prior unknown. Moreover, stepping into the era of Web 2.0, all end users can create Web content freely, which causes that many sources just provide several claims about one or two entities, while only a few sources make plenty of claims about many entities, i.e., the so-called long-tail phenomenon (Li et al., 2014). It is very difficult to assess the reliability of those “small” sources accurately, and an inaccurate estimate would impair the effectiveness of fact verification. To tackle these issues, we propose a novel probabilistic graphical model.
5.1. Probabilistic Graphical Model
The plate diagram of our model is illustrated in Figure 5. For each fact
, we model its probability of being true as a latent random variable, and generate
from a beta distribution:
, with hyperparameter
. Beta distribution is a family of continuous probability distributions defined on the intervaland often used to describe the distribution of a probability value. determines the prior distribution of how likely is to be true, where denotes the prior true count of and denotes the prior false count. The set of all latent truths is denoted by . Once is calculated, the label of can be determined by a threshold , i.e., if , , and False otherwise.
For each source
, we model its error variance
by the scaled inverse chi-squared distribution:, with hyperparameters and to encode our belief that source has labeled facts with variance . The set of error variance for all sources is denoted by . We use the scaled inverse chi-squared distribution for two main reasons. First, it can conveniently handle the effect of sample size to tackle the problem brought by the long-tail phenomenon of source claims (Li et al., 2014)
. Second, we use it to keep the scalability of model inference, as it is a conjugate prior for the variance parameter of normal distribution(Gelman et al., 2013).
For each claim of fact and source , we assume that the observation is drawn from a normal distribution: , with mean and variance . The set of observations is denoted by . We believe that the observations are likely to be centered around latent truth and influenced by source quality. Errors, which are the differences between claims and truths, may occur in every source. If a source is unreliable, the observations that it claims would have a wide spectrum and deviate from the latent truths.
5.2. Truth Inference
Given the observed claim data, we infer the truths of facts with our probabilistic graphical model. Given hyperparameters and , the complete likelihood of all observations and latent variables is
where denote the sets of facts and sources, respectively. is the set of facts claimed in source .
We want to find an assignment of latent truths that maximizes the joint probability, i.e., the maximum-a-posterior estimate for :
Note that this derivation holds as is the observed claim data.
Based on the conjugacy of exponential families, in order to find , we can directly integrate out in Eq. (15):
Therefore, the goal becomes to maximize Eq. (17) w.r.t. , which is equivalent to minimizing the negative log likelihood:
Now, we can apply a gradient descent method to optimize this negative likelihood and infer the unknown latent truths.
5.3. Hyperparameter Setting and Source Reliability Estimation
In this section, we describe how to set hyperparameters and how to estimate source reliability, using prior truths and observed data.
Intuitively, the error variances of different sources should depend on the quality of observations in the sources, rather than set them as constants regardless of sources. As aforementioned, is drawn from . As the sum of squares of standard normal distributions has the chi-squared distribution (Hogg and Craig, 1978), we have
Therefore, we have , where is the sample variance calculated as follows:
So, we can set and to encode that has already provided observations with average squared deviation .
Calculating needs the truths of facts that are claimed in . The most common way is to estimate them by majority voting, which lets , where denotes the set of sources that claims . A better way is to exploit prior knowledge (i.e., existing facts in the KB) to guide truth inference. Here, we use prior truths derived from a subset of popular entities to guide source reliability estimation. For each prior truth in the KB, we directly fix . Besides, we leverage the prior truths to predict whether a property is single-valued or multi-valued by analyzing how popular entities use the property. If the property is single-valued, we only label the fact with the highest probability as correct finally, otherwise the correct facts are determined by threshold .
Furthermore, we find that may not accurately reveal the real variance of a source when is very small, as many sources have very few claims in the real world, which further causes imprecise truth inference. To solve this issue, we adopt the estimation proposed in (Li et al., 2014), which uses the upper bound of the confidence interval of sample variance as an estimator, namely,
After we obtain the inferred truths of facts using the proposed model, the posterior source quality can be calculated by treating the truths as observed data. The maximum-a-posterior estimate of source reliability has a closed-form solution:
Example 5.1 ().
Recall the example in Figure 4. The attention-based GNN model predicts a few properties like people.person.nationality and film.performance.film. For each property, many possible values are found from heterogeneous Web sources (see Figure 6). During fact verification, identical facts from different sources are merged, but each source has its own observations. Finally, the nationality of Erika_Halacherl (USA, ) is correctly identified from others and some films that she dubbed are found as well.
Finally, we add the verified facts back to the KB. For a relation fact, we use the same aforementioned entity linking method to link the value to an existing entity in the KB. If the entity is not found, we create a new one with the range of the relation as its type. If this relation has no range, we assign the type from NER to it. For an attribute fact, we simply create a literal for the value.
6. Experiments and Results
We implemented our approach, called OKELE, on a server with two Intel Xeon Gold 5122 CPUs, 256 GB memory and a NVIDIA GeForce RTX 2080 Ti graphics card. The datasets, source code and gold standards are accessible online.888https://github.com/nju-websoft/OKELE/
6.1. Synthetic Experiment
6.1.1. Dataset Preparation
Our aim in this experiment was twofold. First, we would like to conduct module-based evaluation of OKELE and compare it with related ones in terms of property prediction, value extraction and fact verification. Second, we wanted to evaluate various (hyper-)parameters in OKELE and use the best in the real-world experiment. As far as we know, there exists no benchmark dataset for open knowledge enrichment yet.
We used Freebase as our KB, because it is famous and widely-used. It is no longer updated, making it more appropriate for our problem, as there is a larger difference against current Web data. We chose 10 classes in terms of popularity (i.e., entity numbers) and familiarity with people. For each class, we created a dataset with 1,000 entities for training, 100 ones for validation and 100 for test, all of which were randomly sampled without replacement from top 20% of entities by filtered property numbers (Bast et al., 2014). Table 1 lists statistics of the sampled data.
We leveraged these entities to simulate long-tail entities, based on the local closed world assumption (Li et al., 2017; Ritze et al., 2016) and the leave--out strategy (Zangerle et al., 2016). For each entity, we randomly kept five properties (and related facts) and removed the others from the entity. Then, the removed properties and facts were treated as the ground truths in the experiment. We argue that this evaluation may be influenced by the incompleteness of KBs. But, it can be carried out at large scale and without human judgment. Also, since we used popular entities, the problem of incompleteness is not severe.
|Classes||# Candidate||# Properties||# Facts|
|properties||per test entity|
6.1.2. Experiment on Property Prediction
Below, we describe the experimental setting and report the results.
Comparative models. We selected three categories of models for comparison: (i) the property mining models designed for KBs, (ii) the traditional models widely-used in recommender systems, and (iii) the deep neural network based recommendation models. For each category, we picked several representatives, which are briefly described as follows. Note that, for all of them, we strictly used the (hyper-)parameters suggested in their papers. For the first category, we chose the following three models:
Popularity-based, which ranks properties under a class based on the number of entities using each property.
Predicate suggester in Wikidata (Zangerle et al., 2016), which ranks and suggests candidate properties based on association rules.
Obligatory attributes (Lajus and Suchanek, 2018), which recognizes the obligatory (i.e., not optional) properties of every class in a KB.
For the second category, we chose the following three models:
eALS (He et al., 2016), which is a state-of-the-art model for item recommendation, based on matrix factorization.
For the last category, we picked three models all in (He et al., 2017):
Generalized matrix factorization (GMF), which is a full neural treatment of collaborative filtering. It uses a linear kernel to model the latent feature interactions of items and users.
MLP, which adopts a non-linear kernel to model the latent feature interactions of items and users.
NeuMF, which is a very recent matrix factorization model with the neural network architecture. It combines GMF and MLP by concatenating their last hidden layers to model the complex interactions between items and users.
For OKELE, we searched the initial learning rate in , and in , the number of entity neighbors in , the number of hidden layers in , the dimension of embeddings and latent vectors in , the size of attentions in and the number of negative examples per positive in . The final setting is: learning , , , , both and , and negative examples per) and optimized by Adam (Kingma and Ba, 2015).
Evaluation metrics. Following the conventions, we employed precision@
, normalized discounted cumulative gain (NDCG) and mean average precision (MAP) as the evaluation metrics.
Results. Table 2 lists the results of the comparative models and OKELE on property prediction. We have three main findings: (i) the models in the second and last categories generally outperform those in the first category, demonstrating the necessity of modeling customized properties for entities, rather than just mining generic properties for classes; (ii) GMF obtains better results than others including the state-of-art model eALS, which shows the power of neural network models in recommendation. Note that MLP underperforms GMF, which is in accord with the conclusion in (He et al., 2017). However, we find that NeuMF, which ensembles GMF and MLP, slightly underperforms GMF, due to the non-convex objective function of NeuMF and the relatively poor performance of MLP; and (iii) OKELE is consistently better than all the comparative models. Compared to eALS and GMF, OKELE integrates an advanced GNN architecture to capture complex interactions between entities and properties. Moreover, it uses the attention mechanism during aggregation to model different strengths of the interactions.
|The best and second best are in bold and underline, resp.|
Ablation study. We also conducted an ablation study to assess the effectiveness of each module in the property prediction model. From Table 3, we can observe that removing any of these modules would substantially reduce the effect. Particularly, we find that limiting the interactions of entities down to top- improves the results as controlling the quantity of interactions can filter out noises and concentrate on more informative signals.
|w/o top- ents.||0.876||0.892||0.776||0.851||0.771|
|w/o ent. interact.||0.869||0.884||0.764||0.842||0.762|
6.1.3. Experiment on Value Extraction and Fact Verification
In this test, we gave the correct properties to each test entity and compared the facts from the Web with those in the KB (i.e., Freebase).
Comparative models. We selected the following widely-used models for comparison:
Majority voting, which regards the fact with the maximum number of occurrences among conflicts as truth.
TruthFinder (Yin et al., 2008), which uses Bayesian analysis to iteratively estimate source reliability and identify truths.
PooledInvestment (Pasternack and Roth, 2010), which uniformly gives a source trustworthiness among its claimed facts, and the confidence of a fact is defined on the sum of reliability from its providers.
Latent truth model (LTM) (Zhao et al., 2012), which introduces a graphical model and uses Gibbs sampling to measure source quality and fact truthfulness.
Latent credibility analysis (LCA) (Pasternack and Roth, 2013), which builds a strongly-principled, probabilistic model capturing source credibility with clear semantics.
Confidence-aware truth discovery (CATD) (Li et al., 2014), which detects truths from conflicting data with long-tail phenomenon. It considers the confidence interval of the estimation.
Multi-truth Bayesian model (MBM) (Wang et al., 2015), which presents an integrated Bayesian approach for multi-truth finding.
Again, for all of them, we strictly followed the parameter setting in their papers. For OKELE, we set for all latent truths and chose following the suggestions in Section 5.3. We used 95% confidence interval of sample variance, so the corresponding significance level is 0.05. Besides, the threshold for determining the labels of facts was set to 0.5 for all the models.
Evaluation metrics. Following the conventions, we employed precision, recall and F1-score as the evaluation metrics.
Results. Table 4 illustrates the results of value extraction and fact verification. We have four major findings: (i) not surprisingly, OKELE (value extraction) gains the lowest precision but the highest recall as it collects all values from the Web. All the other models conduct fact verification based on these data; (ii) both TruthFinder and LTM achieve lower F1-scores even than majority voting. The reason is that TruthFinder considers the implications between different facts, which would introduce more noises. LTM makes strong assumptions on the prior distributions of latent variables, which fails for the long-tail phenomenon on the Web; (iii) although the precision of MBM is lower than many models, its recall is quite high. The reason is that MBM tends to give high confidence to the unclaimed values, which not only detects more potential truths but also raises more false positives; and (iv) among all the models, OKELE obtains the best precision and F1-score, followed by CATD, since they both handle the challenges of the long-tail phenomenon by adopting effective estimators based on the confidence interval of source reliability. Furthermore, OKELE incorporates prior truths from popular entities to guide the source reliability estimation and truth inference.
|OKELE (value extraction)||0.222||0.595||0.318|
|OKELE (fact verification)||0.459||0.485||0.459|
Figure 7 depicts the proportions and precisions of facts from different source types, where “overlap” denotes the facts from at least two different source types. We can see from Figure 7(a) that the number of facts extracted from vertical websites only is the largest. The proportion of facts from structured data only is quite low, as most facts obtained in structured data are also found in other sources. In addition to measuring the proportions of extracted facts, it is important to assess their quality. According to Figure 7(b), overlap achieves the highest precision. However, most verified facts still come from vertical websites.
Ablation study. We also performed an ablation study to assess the effectiveness of leveraging prior truths from popular entities in the fact verification model. As depicted in Table 5, the results show that incorporating prior truths to guide the estimation of source reliability improves the performance.
|OKELE (fact verification)||0.459||0.485||0.459|
|w/o prior truths||0.418||0.499||0.438|
6.2. Real-World Experiment
6.2.1. Dataset Preparation and Experimental Setting
To empirically test the performance of OKELE as a whole, we conducted a real-world experiment on real long-tail entities and obtained the truths of enriched facts by human judgments. For each class, we randomly selected 50 long-tail entities that at least have a name, and the candidate properties are the same as in the synthetic experiment. Table 6 lists statistics of these samples. For a long-tail entity, OKELE first predicted 10 properties, and then extracted values and verified facts. We hired 30 graduate students in computer software to judge whether a fact is true or false, and a student was paid 25 USD for participation. No one reported that she could not complete the judgments. Each fact was judged by three students to break the tie, and the students should judge it by their own researches like searching the Web. The final ground truths were obtained by voting among the judgments. The level of agreement, measured by Fleiss’s kappa (Halpin et al., 2010), is 0.812, showing a sufficient agreement.
As far as we know, there is no existing holistic system that can perform open knowledge enrichment for long-tail entities. So, we evaluated the overall performance of OKELE by comparing with the combination of two second best models GMF + CATD. As the complete set of facts is unknown, we can only measure precision.
|Classes||# Props.||# Facts||Classes||# Props.||# Facts|
|per test entity||per test entity|
|# Verified props.||GMF + CATD||280||134||205||218||170||417||65||183||254||207||4.27|
|# Verified facts||GMF + CATD||485||153||228||328||375||722||402||275||303||248||7.04|
|Precision||GMF + CATD||0.845||0.204||0.312||0.527||0.710||0.846||0.501||0.440||0.837||0.444||0.567|
Table 7 shows the results of the real-world experiment on different classes. First of all, we see that the results differ among classes. The number of verified facts in class film is significantly more than those in other classes and it also holds the highest precision. One reason is that there are several high-quality movie portals containing rich knowledge as people often have great interests in films. In contrast, although people are fond of albums as well, OKELE obtains the lowest precision in this class. We find that, since award-winning albums tend to receive more attention, many popular albums in Freebase have award-related properties, which would be further recommended to long-tail albums. However, the majority of long-tail albums have nearly no awards yet. Additionally, albums with very similar names caused disambiguation errors. This also happened in class food. OKELE recommended the biological taxonomy properties from natural edible foods to long-tail artificial foods, which have no such taxonomy. In this sense, OKELE may be misguided by using inappropriate popular entities, especially having multiple more specific types.
The last column of Table 7 lists the average numbers of verified properties and facts per entity as well as the average precision of 10 classes. Overall, we find that the performance of OKELE is generally good and significantly better than GMF + CATD. Additionally, the verified facts are from 3,482 Web sources in total. The average run-time per entity is 326.8 seconds, where 24.5% of the time is spent on network transmission and 40.8% is spent on NER on plain text. For comparison, we did the same experiment on the synthetic dataset, and OKELE enriched 4.35 properties and 20.59 facts per entity. The average precision, recall and F1-score are 0.479, 0.485 and 0.471, respectively. We owe the precision difference to the incompleteness of the KB. In the synthetic test, we only consider the facts in the KB as correct. Thus, some correct facts from the Web may be misjudged.
We also conducted the module-based evaluation. For property prediction, we measured top-10 precisions w.r.t. properties in the facts that humans judge as correct. The average precisions of OKELE and GMF are 0.497 and 0.428, respectively. For fact verification, we used the same raw facts extracted by OKELE. The average precisions of OKELE and CATD are 0.624 and 0.605, respectively.
Figure 8 illustrates the proportions and precisions of facts from different source types in the real-world experiment. Similar to Figure 7, vertical websites account for the largest proportion and structured data for the least. Overlap still holds the highest precision. However, the proportion of vertical websites declines while the proportion of plain text increases. This is due to that, as compared with popular entities, few people would like to organize long-tail entities into structured or semi-structured knowledge.
7. Related Work
7.1. KB Enrichment
There is a wide spectrum of works that attempts to enrich KBs from various aspects (Getman et al., 2018; Paulheim, 2017; Wang et al., 2017). According to the data used, we divide existing approaches into internal and external enrichment.
Internal enrichment approaches focus on completing missing facts in a KB by making use of the KB itself. Specifically, link prediction expects one to predict whether a relation holds between two entities in a KB. Recent studies (Bordes et al., 2013; Dettmers et al., 2018; Sun et al., 2019) have adopted the embedding techniques to embed entities and relations of a KB into a low-dimensional vector space and predicted missing links by a ranking procedure using the learned vector representations and scoring functions. An exception is ConMask (Shi and Weninger, 2019), which learns embeddings of entity names and parts of text descriptions to connect unseen entities to a KB. However, all these studies only target on entity-relations, but cannot handle attribute-values yet. We refer interested readers to the survey (Wang et al., 2017) for more details. Another line of studies is based upon rule learning. Recent notable systems, such as AMIE (Galárraga et al., 2013) and RuleN (Meilicke et al., 2018)
, have applied inductive logic programming to mine logical rules and used these rules to deduce missing facts in a KB. In summary, the internal enrichment approaches cannot discover new facts outside a KB and may suffer from the limited information of long-tail entities.
External enrichment approaches aim to increase the coverage of a KB with external resources. The TAC KBP task has promoted progress on information extraction from free text (Getman et al., 2018), and many successful systems, e.g., (Chen et al., 2014; Min et al., 2017), use distant supervision together with hand-crafted rules and query expansion (Surdeanu and Ji, 2014). In addition to text, some KB augmentation works utilize HTML tables (Ritze et al., 2016; Kruit et al., 2019) and embedded markup data (Yu et al., 2019) available on the Web. Different from these works that are tailored to specific sources, the goal of this paper is to diversify our Web sources for value extraction. Besides, Knowledge Vault(KV) (Dong et al., 2014) is a Web-scale probabilistic KB, in which facts are automatically extracted from text documents, DOM trees, HTML tables and human annotated pages. The main difference between our work and KV is that we aim to enrich long-tail entities in an existing KB, while KV wants to create a new KB and oftentimes regards popular entities.
7.2. Long-tail Entities
In the past few years, a few works have begun to pay attention to the long-tail phenomenon in KBs. The work in (Reinanda et al., 2016)
copes with the entity-centric document filtering problem and proposes an entity-independent method to classify vital and non-vital documents particularly for long-tail entities. The work in(Esquivel et al., 2017) analyzes the challenges of linking long-tail entities in news corpora to general KBs. The work in (Hoffart et al., 2014) recognizes emerging entities in news and other Web streams, where an emerging entity refers to a known long-tail entity in a KB or a new one to be added into the KB. LONLIES (Farid et al., 2016) leverages a text corpus to discover entities co-mentioned with a long-tail entity, and generates an estimation of a property-value from the property-value set of co-mentioned entities. However, LONIES needs to give a target property manually and cannot find new property-values that do not exist in the property-value set. The work in (Li et al., 2017) tackles the problem of knowledge verification for long-tail verticals (i.e., less popular domains). It collects tail-vertical knowledge by crowdsourcing due to the lack of training data, while our work automatically finds and verifies knowledge for long-tail entities from various Web sources. Besides, the work in (Oulabi and Bizer, 2019) explores the potential of Web tables for extending a KB with new long-tail entities and their descriptions. It develops a different pipeline system with several components, including schema matching, row clustering, entity creation and new detection.
In this paper, we introduced OKELE, a full-fledged approach of open knowledge enrichment for long-tail entities. The enrichment process consists of property prediction with attention-based GNN, value extraction from diversified Web sources, and fact verification with probabilistic graphical model, in all of which prior knowledge from popular entities is participated. Our experiments on the synthetic and real-world datasets showed the superiority of OKELE against various competitors. In future, we plan to optimize a few key modules to accelerate the enrichment speed. We also want to study a full neural network-based architecture.
Acknowledgements.This work is supported by the National Natural Science Foundation of China (No. 61872172), the Water Resource Science & Technology Project of Jiangsu Province (No. 2019046), and the Collaborative Innovation Center of Novel Software Technology & Industrialization.
- Easy access to the freebase databset. In WWW, Seoul, Korea, pp. 95–98. Cited by: §6.1.1.
- Methods for exploring and mining tables on wikipedia. In IDEA, Chicago, IL, USA, pp. 18–26. Cited by: §4.3.
- Translating embeddings for modeling multi-relational data. In NIPS, Lake Tahoe, NV, USA, pp. 2787–2795. Cited by: §1, §7.1.
- Joint inference for knowledge base population. In EMNLP, Doha, Qatar, pp. 1912–1923. Cited by: §1, §7.1.
- Improving efficiency and accuracy in multilingual entity extraction. In I-SEMANTICS, Graz, Austria, pp. 121–124. Cited by: §4.2.
Convolutional 2D knowledge graph embeddings. In AAAI, New Orleans, LA, USA, pp. 1811–1818. Cited by: §1, §7.1.
- Knowledge vault: a web-scale approach to probablistic knowledge fusion. In KDD, New York, NY, USA, pp. 601–610. Cited by: §1, §1, §7.1.
- On the long-tail entities in news. In ECIR, Aberdeen, UK, pp. 691–697. Cited by: §1, §7.2.
- LONLIES: estimating property values for long tail entities. In SIGIR, Pisa, Italy, pp. 1125–1128. Cited by: §7.2.
- AMIE: association rule mining under incomplete evidence in ontological knowledge bases. In WWW, Rio de Janeiro, Brazil, pp. 413–422. Cited by: §7.1.
- Bayesian data analysis. Chapman and Hall/CRC, Boca Raton, FL, USA. Cited by: §5.1.
- Laying the groundwork for knowledge base population: nine years of linguistic resources for TAC KBP. In LREC, Miyazaki, Japan, pp. 1552–1558. Cited by: §1, §7.1, §7.1.
- FACES: diversity-aware entity summarization using incremental hierarchical conceptual clustering. In AAAI, Austin, TX, USA, pp. 116–122. Cited by: 3rd item.
- Evaluating ad-hoc object retrieval. In IWEST, Shanghai, China, pp. 1–12. Cited by: §6.2.1.
- Neural collaborative filtering. In WWW, Perth, Australia, pp. 173–182. Cited by: §6.1.2, §6.1.2.
- Fast matrix factorization for online recommendation with implicit feedback. In SIGIR, Pisa, Italy, pp. 549–558. Cited by: 2nd item.
- Deep convolutional networks on graph-structured data. CoRR abs/1506.05163, pp. 1–10. Cited by: §3.
- Discovering emerging entities with ambiguous names. In WWW, Seoul, Korea, pp. 385–396. Cited by: §1, §7.2.
- Introduction to mathematical statistics. 4th edition, Macmillan Publishing Co., New York, NY, USA. Cited by: §5.3.
- Adam: a method for stochastic optimization. In ICLR, San Diego, CA, USA, pp. 1–15. Cited by: §3.2, §6.1.2.
- Semi-supervised classification with graph convolutional networks. In ICLR, Toulon, France, pp. 1–14. Cited by: §3.
- Extracting novel facts from tables for knowledge graph completion. In ISWC, Auckland, New Zealand, pp. 364–381. Cited by: §1, §4.3, §7.1.
- Are all people married?: determining obligatory attributes in knowledge bases. In WWW, Lyon, France, pp. 1115–1124. Cited by: §1, 3rd item.
- Knowledge verification for long-tail verticals. Proceedings of the VLDB Endowment 10 (11), pp. 1370–1381. Cited by: §1, §1, §4.3, §6.1.1, §7.2.
- A confidence-aware approach for truth discovery on long-tail data. Proceedings of the VLDB Endowment 8 (4), pp. 425–436. Cited by: §5.1, §5.3, §5, 6th item.
- Truth inference at scale: a bayesian model for adjudicating highly redundant crowd annotations. In WWW, San Francisco, CA, USA, pp. 1028–1038. Cited by: 8th item.
- Neural relation extraction with selective attention over instances. In ACL, Berlin, Germany, pp. 2124–2133. Cited by: §4.2, §4.2.
- CERES: distantly supervised relation extraction from the semi-structured web. Proceedings of the VLDB Endowment 11 (10), pp. 1084–1096. Cited by: §4.1.
- The Stanford CoreNLP natural language processing toolkit. In ACL, Baltimore, MD, USA, pp. 55–60. Cited by: §4.2.
- Open information extraction systems and downstream applications. In IJCAI, New York, NY, USA, pp. 4074–4077. Cited by: §4.2.
- Fine-grained evaluation of rule- and embedding-based systems for knowledge graph completion. In ISWC, Monterey, CA, USA, pp. 3–20. Cited by: §7.1.
- Probabilistic inference for cold start knowledge base population with prior world knowledge. In EACL, Valencia, Spain, pp. 601–612. Cited by: §7.1.
- Distant supervision for relation extraction without labeled data. In ACL, Suntec, Singapore, pp. 1003–1011. Cited by: §4.2.
- Extending cross-domain knowledge bases with long tail entities using web table data. In EDBT, Lisbon, Portugal, pp. 385–396. Cited by: §7.2.
- Knowing what to believe (when you already know something). In COLING, Beijing, China, pp. 877–885. Cited by: 3rd item.
- Latent credibility analysis. In WWW, Rio de Janeiro, Brazil, pp. 1009–1020. Cited by: 5th item.
- Knowledge graph refinement: a survey of approaches and evaluation methods. Semantic Web 8 (3), pp. 489–508. Cited by: §1, §7.1.
- How much is a triple? estimating the cost of knowledge graph creation. In ISWC, Monterey, CA, USA, pp. 1–4. Cited by: §1.
- Entity linking: finding extracted entities in a knowledge base. In Multi-source, Multilingual Information Extraction and Summarization, pp. 93–115. Cited by: §4.2.
- Knowledge base recall: detecting and resolving the unknown unknowns. ACM SIGWEB Newsletter 3, pp. 1–9. Cited by: §1.
- Document filtering for long-tail entities. In CIKM, Indianapolis, IN, USA, pp. 771–780. Cited by: §7.2.
- Profiling the potential of web tables for augmenting cross-domain knowledge bases. In WWW, Montréal, Canada, pp. 251–261. Cited by: §1, §4.3, §6.1.1, §7.1.
- Item-based collaborative filtering recommendation algorithms. In WWW, Hong Kong, China, pp. 285–295. Cited by: 1st item.
- Open-world knowledge graph completion. In AAAI, New Orleans, LA, USA, pp. 1957–1964. Cited by: §1, §7.1.
- RotatE: knowledge graph embedding by relational rotation in complex space. In ICLR, New Orleans, LA, USA, pp. 1–18. Cited by: §1, §7.1.
- Overview of the English slot filling track at the TAC2014 knowledge base population evaluation. In TAC, Gaithersburg, MD, USA, pp. 15. Cited by: §1, §7.1.
- Voldemortkg: mapping schema.org and web entities to linked open data. In ISWC, Kobe, Japan, pp. 220–228. Cited by: §4.3.
- Graph attention networks. In ICLR, Vancouver, Canada, pp. 1–12. Cited by: §3.2, §3.
- Knowledge graph embedding: a survey of approaches and applications. IEEE Transactions on Knowledge and Data Engineering 29 (12), pp. 2724–2743. Cited by: §1, §7.1, §7.1.
- An integrated bayesian approach for effective multi-truth discovery. In CIKM, Melbourne, Australia, pp. 493–502. Cited by: 7th item.
- Truth discovery with multiple conflicting information providers on the web. IEEE Transactions on Knowledge and Data Engineering 20 (6), pp. 796–808. Cited by: 2nd item.
- KnowMore - knowledge base augmentation with structured web markup. Semantic Web 10 (1), pp. 159–180. Cited by: §1, §1, §7.1.
- An empirical evaluation of property recommender systems for wikidata and collaborative knowledge bases. In OpenSym, Berlin, Germany, pp. 1–18. Cited by: §1, 2nd item, §6.1.1.
- Long-tail relation extraction via knowledge graph embeddings and graph convolution networks. In NAACL-HLT, Minneapolis, MN, USA, pp. 3016–3025. Cited by: §1.
- A bayesian approach to discovering truth from conflicting sources for data integration. Proceedings of the VLDB Endowment 5 (6), pp. 550–561. Cited by: 4th item.