Relational match propagation
Knowledge bases (KBs) store rich yet heterogeneous entities and facts. Entity resolution (ER) aims to identify entities in KBs which refer to the same real-world object. Recent studies have shown significant benefits of involving humans in the loop of ER. They often resolve entities with pairwise similarity measures over attribute values and resort to the crowds to label uncertain ones. However, existing methods still suffer from high labor costs and insufficient labeling to some extent. In this paper, we propose a novel approach called crowdsourced collective ER, which leverages the relationships between entities to infer matches jointly rather than independently. Specifically, it iteratively asks human workers to label picked entity pairs and propagates the labeling information to their neighbors in distance. During this process, we address the problems of candidate entity pruning, probabilistic propagation, optimal question selection and error-tolerant truth inference. Our experiments on real-world datasets demonstrate that, compared with state-of-the-art methods, our approach achieves superior accuracy with much less labeling.READ FULL TEXT VIEW PDF
Relational match propagation
Knowledge bases (KBs) store rich yet heterogeneous entities and facts about the real world, where each fact is structured as a triple in the form of . Entity resolution (ER) aims at identifying entities referring to the same real-world object, which is critical in cleansing and integration of KBs. Existing approaches exploit diversified features of KBs, such as attribute values and entity relationships, see surveys [1, 2, 3, 4]. Recent studies have demonstrated that crowdsourced ER, which recruits human workers to solve micro-tasks (e.g., judging if a pair of entities is a match), can improve the overall accuracy.
Current crowdsourced ER approaches mainly leverage transitivity [5, 6, 7] or monotonicity [8, 9, 10, 11, 12] as their resolution basis. The transitivity-based approaches rely on the observation that the match relation is usually an equivalence
relation. The monotonicity-based ones assume that each pair of entities can be represented by a similarity vector of attribute values, and the binary classification function, which judges whether a similarity vector is a match, is monotonic in terms of thepartial order among the similarity vectors.
However, both kinds of approaches can hardly infer matches across different types of entities. Let us see Figure 1 for example. The figure shows a directed graph, called entity resolution graph (ER graph), in which each vertex denotes a pair of entities and each edge denotes a relationship between two entity pairs. Assume that is labeled as a match, the birth place pair is expected to be a match. Since these two pairs are in different equivalence classes, the transitivity-based approaches are apparently unable to take effect. As different relationships (like y:directedBy and y:wasBornIn) make most similarity vectors of entities of different types incomparable, the monotonicity-based approaches have to handle them separately.
In this paper, we propose a new approach called Remp (Relational match propagation) to address the above problems. The main idea is to leverage collective ER that resolves entities connected by relationships jointly and distantly, based on a small amount of labels provided by workers. Specifically, Remp iteratively asks workers to label a few entity pairs and propagates the labeling information to their neighboring entity pairs in distance, which are then resolved jointly rather than independently. There remain two challenges to achieve such a crowdsourced collective ER.
The first challenge is how to conduct an effective relational match propagation. Relationships like functional/inverse functional properties in OWL  (e.g., y:wasBornIn) provide a strong evidence, but these properties only account for a small portion while the majority of relationships is multi-valued (e.g., actedIn). Multi-valued relationships often connect non-matches to matches (e.g., is connected to in Figure 1). Therefore, we propose a new relational match propagation model, to decide which neighbors can be safely inferred as matches.
The second challenge is how to select good questions to ask workers. For an ER graph involving two large KBs, the number of vertices (i.e. candidate questions) can be quadratic. We introduce an entity pair pruning algorithm to narrow the search space of questions. Moreover, different questions have different inference power. In order to maximize the expected number of inferred matches, we propose a question selection algorithm, which chooses possible entity matches scattered in different parts of the ER graph to achieve the largest number of inferred matches.
In summary, the main contributions of this paper are listed as follows:
We design a partial order based entity pruning algorithm, which significantly reduces the size of an ER graph.
We propose a relational match propagation model, which can jointly infer the matches between different types of entities in distance.
We formulate the problem of optimal multiple questions selection with cost constraint, and design an efficient algorithm to obtain approximate solutions.
We present an error-tolerant method to infer truths from imperfect human labeling. Moreover, we train a classifier to handle isolated entity pairs.
We conduct real-world experiments and comparison with state-of-the-art approaches to assess the performance of our approach. The experimental results show that our approach achieves superior accuracy with much fewer labeling tasks.
make use of prior match probabilities to decide the order of questions. Firmani et al. proved that the optimal strategy is to ask questions in descending order of entity cluster size. They formulated the problem of crowdsourced ER with early termination and put forward several question ordering strategies. Although the transitive relation can infer matches within each cluster, workers need to check all clusters.
use the monotonicity property to search new thresholds, and estimate the precision of results. In particular, the partial order based approaches[15, 16, 12] explore similarity thresholds among similarity vectors. Furthermore, POWER  groups similarity vectors to reduce the search space. Corleone  and Falcon 
learn random forest classifiers, where each decision tree is equivalent to a similarity vector. However, these approaches are designed for ER with single entity type. To leverage monotonicity on ER between KBs with complex type information, HIKE uses hierarchical agglomerative clustering to partition entities with similar attributes and relationships, and uses the monotonicity techniques on each entity partition to find matches. Although our approach also uses monotonicity, it only uses monotonicity to prune candidate entity pairs. In addition, our approach allows match inference between different entity types (e.g., from persons to locations) to reduce the labeling efforts.
Question interfaces. Pairwise and multi-item are two common question interfaces. The pairwise interface asks workers to judge whether a pair of entities is a match [7, 17]. Differently, Marcus et al.  proposed a multi-item interface to save questions, where each question contains multiple entities to be grouped. Wang et al.  minimized the number of multi-item questions on the given entity pair set such that each question contains at most entities. Waldo  is a recent hybrid interface, which optimizes the trade-off between cost and accuracy of the two question interfaces based on task difficulty. The above approaches do not have the inference power and they may generate a large amount of questions.
Quality control. To deal with errors produced by workers, quality control techniques [6, 20, 21] leverage the correlation between matches and workers to find inaccurate labels, and improve the accuracy by asking more questions about uncertain ones. These approaches gain improvement by redundant labeling.
In addition to attribute values, collective ER [22, 23, 24, 25] further takes the relationships between entities into account. CMD  extends the probabilistic soft logic to learn rules for ontology matching. LMT  learns soft logic rules to resolve entities in a familial network. Because learning a probabilistic distribution on large KBs is time-consuming, PARIS  and SiGMa  implement message passing-style algorithms that obtain seed matches created by hand crafted rules and pass the match messages to their neighbors. However, they do not leverage crowdsourcing to improve the ER accuracy and may encounter the error accumulation problem.
In this section, we present necessary preliminaries to define our problem, followed by a general workflow of our approach. Frequently used notations are summarized in Table I.
KBs store rich, structured real-world facts. In a KB, each fact is stated in a triple of the form , where can be either an attribute or a relationship, and can be either a literal or another entity. The sets of entities, literals, attributes, relationships and triples are denoted by and , respectively. Therefore, a KB is defined as a 5-tuple . Moreover, attribute triples attach entities with literals, e.g., , and relationship triples link entities by relationships, e.g., .
Entity Resolution (ER) aims to resolve entities in KBs denoting the same real-world thing. Let denote two entities in two different KBs. We call the entity pair a match and denote it by or if refer to the same. In contrast, we call a non-match and denote it by if refer to two different objects. Both matches and non-matches are regarded as resolved entity pairs, and other pairs are regarded as unresolved. Traditionally, reference matches (i.e., gold standard) are used to evaluate the quality of the ER results, and precision, recall and F1-score are widely-used metrics.
Crowdsourced ER carries out ER with human helps. Usually, it executes several human-machine loops, and in each loop, the machine picks one or several questions to ask workers to label them and updates the ER results in terms of the labels. Due to the monetary cost of human labors, a crowdsourced ER algorithm is expected to ask limited questions while obtaining as many results as possible.
Given two KBs and , and a budget, the crowdsourced collective ER problem is to maximize recall with a precision restriction by asking humans to label tasks while not exceeding the budget.
Specifically, we assume that both KBs contain “dense” relationships and focus on using matches obtained from workers to jointly infer matches with relationships.
Given two KBs as input, Figure 2 shows the workflow of our approach to crowdsourced collective ER. After iterating four processing stages, the approach returns a set of matches between the two KBs.
ER graph construction aims to construct a small ER graph by reducing the amount of vertices (i.e. entity pairs). It first conducts a similarity measurement to filter out some non-matches. At the same time, it uses some matches obtained from exact matching [12, 29, 30] to calculate the similarities between attributes and find attribute matches. Then, based on the attribute matches, it assembles the similarities between values to similarity vectors, and leverages the natural partial order on the vectors to prune more vertices.
Relational match propagation models how to use matches to infer the match probabilities of unresolved entity pairs in each connected component of the ER graph. It first uses some matches and maximum likelihood estimation to measure the consistency of relationships. Then, based on the consistency of relationships and the ER graph structure, it computes the conditional match probabilities of unresolved entity pairs given the matches. The conditional match probabilities derive a probabilistic ER graph.
Multiple questions selection selects a set of unresolved entity pairs in the probabilistic ER graph as questions to ask workers. It models the discovery of inferred match set for each question as the all-pairs shortest path problem and uses a graph-based algorithm to solve it. We prove that the multiple questions selection problem is NP-hard and design a greedy algorithm to find the best questions to ask.
infers matches based on the results labeled by workers. It first computes the posterior match probabilities of the questions based on the quality of the workers, and then leverages these posterior probabilities to update the (probabilistic) ER graph. Also, for isolated entity pairs, it builds a random forest classifier to avoid asking the workers to check them one by one.
The approach stops asking more questions when there is no unresolved entity pair that can be inferred by relational match propagation.
Graph structures [31, 32] are widely used to model the resolution states of entity pairs and the relationships between them. For example, Dong et al.  proposed dependency graph to model the dependency between similarities of entity pairs. In this paper, we use the notion of ER graph to denote this graph structure. Different from dependency graph, each edge in the ER graph is labeled with a pair of relationships from two KBs.
Given two KBs and , an ER graph on and is a directed, edge-labeled multigraph , such that (1) ; (2) for each vertex pair , if and only if .
Figure 1 illustrates an ER graph fragment built from DBpedia and YAGO. Note that, an entity can occur in multiple vertices, and a relationship can appear in different edge labels. A probabilistic ER graph is an ER graph where each edge is labeled with a conditional probability . The major challenge of constructing an ER graph is how to significantly reduce the size of the graph while preserving as many potential entity matches as possible.
We conduct a string matching on entity labels (e.g., the values of rdfs:label) to generate candidate entity matches and regard them as vertices in the ER graph. Specifically, we first normalize entity labels via lowercasing, tokenization, stemming, etc. Then, we leverage the coefficient—the size of the intersection divided by the size of the union of two sets—as our similarity measure to compute similarities on the normalized label token sets and follow the previous studies [5, 16, 19] to prune the entity pairs whose similarities are less than a predefined threshold (e.g., 0.3). Although the choice of thresholds is dataset dependent, this process runs fast and largely reduces the amount of non-matches, thus helping the ER approaches scale up. Note that there are many choices on the similarity metric, e.g., , , and edit distance ; our approach can work with any of them and we use for illustration purpose only. The set of candidate entity matches is denoted by . Similar to , we use the label similarities as prior match probabilities (i.e., ). More accurate estimation in [6, 7] can be achieved by human labeling.
In , we refer to the subset of its entities that has exactly the same labels as initial entity matches. We leverage them as a priori knowledge for attribute and relationship matching (see Sections IV-C and V-A). Other features, e.g., owl:sameAs and inverse functional properties , may also be used to infer initial entity matches [34, 30]. Note that we do not directly add initial entity matches in the final ER results, because they may contain errors. The set of initial entity matches is denoted by .
For such a set of initial entity matches between two KBs and , we proceed to define the following attribute similarity to find their attribute matches. For any two attributes and , their similarity is defined as the average similarity of their values:
where and is defined analogously. represents an extended similarity measure for two sets of literals, which employs an internal literal similarity measure and a threshold to determine two literals being the same when their similarity is not lower than the threshold . For different types of literals, we use the coefficient for strings and the maximum percentage difference for numbers (e.g., integers, floats and dates). The threshold is set to 0.9 to guarantee high precision. We refer interested readers to  for more information about attribute matching.
For simplicity, every attribute in one KB is restricted to match at most one attribute in the other KB. This global 1:1 matching constraint is widely used in ontology matching , and facilitates our assembling of similarity vectors (later in Section IV-D). The 1:1 attribute matching selection is modeled as the bipartite graph matching problem and solved with the Hungarian algorithm  in time. The set of attribute matches is denoted by .
Given the candidate entity match set and the attribute match set , for each candidate , we create a similarity vector , where is the literal similarity () between and on the attribute match (). As a consequence, a natural partial order exists among the similarity vectors: if and only if . This partial order can be used to determine whether an entity pair is a (non-)match in two ways: (i) an entity pair is a match if there exists an entity pair such that is a match and ; and (ii) is a non-match if there exists such that is a non-match and .
We incorporate this partial order into a -nearest neighbor search for further pruning the candidate entity match set . Let us assume that an entity in one KB has a set of candidate match counterparts in another KB. The similarity vectors are written as , and we want to determine the top- in them. Since the partial order is a weak ordering, we count the number of vectors strictly larger than each pair as its “rank”, i.e, the minimal rank in all possible refined full orders. Note that the counterparts of entities in one entity pair are both considered. So, the worst rank of an entity pair , denoted by , is
where all .
By , we design a modified -nearest neighbor algorithm on this partial order (see Algorithm 1). Because the full order among candidate entity matches is unknown, instead of finding the top- matches directly, we prune the ones that cannot be in top-. Thus, each entity pair such that needs to be pruned. Also, each pair smaller than a pruned pair should be removed based on the partial order to avoid redundant checking, because of these pairs must be greater than . The set of retained entity matches is denoted by , where each entity is involved in nearly candidate matches, due to the weak ordering of partial order.
Algorithm 1 first partitions entity match set into each block where all pairs contain the same entity (Line 8). Then, it checks each entity pair , and prunes entity pairs such that (Lines 10–12). Finally, the retained pairs in are added into the output match set.
Algorithm 1 first takes time to pre-compute the similarity vectors. When processing , the pruning step (Lines 7–13) checks at most pairs, and each time it spends time to compute , prune pairs in and store the retained pairs in . So, the overall time complexity of Algorithm 1 is . In practice, similarity vector construction is the most time-consuming part, while the pruning step only needs to check a small amount of entities in or .
Given an ER graph and an entity match in it, the relational match propagation infers how likely each unresolved entity pair is a match based on the structure of , i.e. . In this section, we first consider a basic case that unresolved entity pairs are neighbors of a match in
. Then, we generalize it to the case that unresolved pairs are reachable from several matches. In the basic case, we resolve entity pairs between two value sets of a relationship pair, and define the consistency between relationships to measure the portion of values containing matched counterparts in another value set. The consistency and the prior match probabilities of entity pairs are further combined to obtain “tight” posterior match probabilities. In the general case, we propose a Markov model on paths from matches to unresolved ones to find the match probability bounds.
Functional/inverse functional properties are ideal for match propagation. For example, wasBornIn is a functional property, and the born places of two persons in a match must be identical. However, we cannot just rely on functional/inverse functional properties, since many relationships are multi-valued and only a part of the values may match. Thus, we define the consistency between relationships as follows.
Let and be two relationships in two KBs . We assume that, given the condition that , the probability of the event is subject to a binary distribution with parameter . Symmetrically, we define parameter . We use and to represent the consistency for two relationships and , respectively:
where are the value sets of relationships w.r.t. entities , respectively.
To estimate and , we use the value distribution on the initial entity matches . For an entity pair
, we introduce a latent random variable, where denotes the set of entity matches in . Note that we omit in and to simplify notations. Similar to , we make an assumption on the entity sets: no duplicate entities exist in each entity set. Hence, is also the number of entities in (or ) which appear in . Based on the latent variable , the likelihood probability of is
Then, we use the maximum likelihood estimation to obtain and :
Since each is an integer variable, the brute-force optimization can cost exponential time. Next, we present an optimization process. Let and , where . We simplify (5) to , where . Notice that has only one solution for different integers . Thus, the curves () can have at most common points, where . Therefore, is an -piecewise continuous function, and the product of these -piecewise continuous functions is an -piecewise continuous function. As a result, we can optimize (5) by solving continuous optimization problems with two variables, which runs efficiently.
A basic case is that the unresolved entity pairs are adjacent to a match in . We consider the neighbors with the same edge label, i.e. relationship pair , together. Then, our goal is to identify matches between and .
Let denote a set of entity matches. We consider two factors about how likely can be the correct match result of : (1) the prior match probabilities of matches without neighborhood information; (2) the consistency of the relationships. The match probability of given is:
where is the normalization factor. is the prior match probability. are the consistency of w.r.t. , respectively.
Without considering neighborhood information, the prior match probability is defined as the likelihood function of :
denotes the prior probability of entity pairbeing a match, and denotes the prior probability of being a non-match.
Let . Note that when and form a match, each entity is a neighbor of for relationship such that . Based on , the consistency of given is defined as follows:
and can be defined similarly.
Finally, we obtain the posterior match probability of by marginalizing :
where is selected over .
Example. Let , and denote the relationship directed, , and (implying all pairs are viewed as the same). From Figure 1, we can find that and . Thus, when , ; when , . So, is more likely to be the match set within . Furthermore, , whereas .
The above match propagation to neighbors only estimates the match probabilities of direct neighbors of an entity match, which lacks the capability of discovering entity matches far away. In the following, we extend it to a more general case, called distant match propagation, where a match reaches an unresolved entity pair through a path.
Intuitively, given a match and an unresolved pair , the distant propagation process can be modeled as a path consisting of the entity pairs from to , where each unresolved pair can be inferred as a match via its precedent. Assume that there is a path in , where and
. According to the chain rule of conditional probability, we have
where the last “=” holds because we assume that this propagation path satisfies the Markov property . Inequation (10) gives a lower bound for . The largest lower bound is selected to estimate . We estimate in Algorithm 2.
Based on the relational match propagation, unresolved entity pairs can be inferred by human-labeled matches. However, different questions have different inference capabilities. In this section, we first describe the definition of inferred match set and the multiple questions selection problem. Then, we design a graph-based algorithm to determine the inferred match set for each question. Finally, we formulate the benefit of multiple questions and design a greedy algorithm to select the best questions.
We follow the so-called pairwise question interface [5, 6, 7, 12, 14, 17], where each question is whether an entity pair is a match or not. Let be a set of pairwise questions. Labeling can be defined as a binary function , where for each question , means that is labeled as a match, while indicates that is labeled as a non-match.
Given the labels , we propagate the labeled matches in to unresolved pairs. The set of entity pairs that can be inferred as matches by is
where is the unresolved entity pairs and is the precision threshold for inferring high-quality matches. We evaluate in Section VI-B.
Since non-matches are quadratically more than matches in the ER problem , the labels to the ideal questions should infer as many matches as possible. Thus, we define the benefit function of as the expected number of matches can be inferred by labels to , which is
The ER algorithm can ask each question with the greatest iteratively; however, there is a latency caused by waiting for workers to finish the question. Assigning multiple questions to workers simultaneously in one human-machine loop is a straightforward optimization to reduce the latency. Since workers in crowdsourcing platforms are paid based on the number of solved questions, the number of questions should be smaller than a given budget. Thus, the optimal multiple questions selection problem is to
where is the constraint on the number of questions asked.
In order to obtain the for each question set , we need to compute for each . To estimate in , we define the length of a directed edge in probabilistic ER graph as . According to the definition of , , where is the distance of the shortest path from to . As a result, the condition can be interpreted as . Note that edge can be removed when to avoid .
The all-pairs shortest path algorithms can efficiently compute for every . Since most should be smaller than , we choose to apply binary trees rather than an array of size to maintain distances. We depict our modified Floyd-Warshall algorithm in Algorithm 2. In Lines 1–2, for every , we create a binary tree to store the inferred pairs as well as their corresponding lengths, and a binary tree to store pairs inferring as well as their corresponding lengths. In Lines 3–5, the edge whose length is not greater than would be stored into binary trees. In Lines 6–11, we modify the dynamic programming process in the original Floyd–Warshall algorithm. Since the number of pairs which can be inferred is significantly less than , the inner loop in Lines 9–11 iterate only over the set of distances which are likely to be updated. Lines 13–14 extract the inferred match sets from binary trees.
Since each binary tree contains at most elements, . The loop in Lines 6–11 takes time in total. The time complexity of Algorithm 2 is .
Since the match propagation works independently for each label, the event that an entity pair is inferred as a match by labels is equivalent to the event that is inferred by such that . When is not labeled, is resolved as a match if and only if at least one question that can resolve as a match is labeled as a match. Given the question set , the probability that can be resolved as a match by labels is