Automatic construction of large-scale knowledge bases is very important for the communities of database and knowledge management. Knowledge fusion (KF) (Dong et al., 2014b) is one of the methods used to automatically construct knowledge bases (a.k.a. knowledge harvesting). It collects the possibly conflicting values of objects from data sources and applies truth discovery techniques for resolving the conflicts in the collected values. Since the values are extracted from unstructured or semi-structured data, the collected information exhibits error-prone behavior. The goal of the truth discovery used in knowledge fusion is to infer the true value of each object from the noisy observed values retrieved from multiple information sources while simultaneously estimating the reliabilities of the sources. Two potential applications of knowledge fusion are web source trustworthiness estimation and data cleaning (Dong and Srivastava, 2015). By utilizing truth discovery algorithms, we can evaluate the quality of web sources and find systematic errors in data curation by analyzing the identified wrong values.
Truth discovery with hierarchies: As pointed out in (Dong et al., 2014a, b; Li et al., 2016a), the extracted values can be hierarchically structured. In this case, there may be multiple correct values in the hierarchy for an object even for functional predicates and we can utilize them to find the most specific correct value among the candidate values. For example, consider the three claimed values of ‘NY’, ‘Liberty Island’ and ‘LA’ about the location of the Statue of Liberty in Table 1. Because Liberty Island is an island in NY, ‘NY’ and ‘Liberty Island’ do not conflict with each other. Thus, we can conclude that the Statue of Liberty stands on Liberty Island in NY.
We also observed that many sources provide generalized values in the real-life. Figure 1 shows the graph of the generalized accuracy against the accuracy of the sources in the real-life datasets BirthPlaces and Heritages used for experiments in Section 5. The accuracy and the generalized accuracy of a source are the proportions of the exactly correct values and hierarchically-correct values among all claimed values, respectively. If a source claims exactly correct values without generalization, it is located at the dotted diagonal line in the graph. This graph shows that many sources in real-life datasets claim with generalized values and each source has its own tendency of generalization when claiming values.
Most of the existing methods (Zhao et al., 2012; Pasternack and Roth, 2013; Zheng et al., 2016; Dong et al., 2009, 2012) simply regard the generalized values of a correct value as incorrect. Thus, it causes a problem in estimating the reliabilities of sources. According to (Dong et al., 2014b), 35% of the false negatives in the data fusion task are produced by ignoring such hierarchical structures. Note that there are many publicly available hierarchies such as WordNet (University, 2010) and DBpedia (Auer et al., 2007). Thus, a truth discovery algorithm to incorporate hierarchies is proposed in (Beretta et al., 2016). However, it does not consider the different tendencies of generalization and may lead to the degradation of the accuracy. Another drawback is that it needs a threshold to control the granularity of the estimated truth.
|Statue of Liberty||UNESCO||NY|
|Statue of Liberty||Wikipedia||Liberty Island|
|Statue of Liberty||Arrangy||LA|
We propose a novel probabilistic model to capture the different generalization tendencies shown in Figure 1. Existing probabilistic models (Pasternack and Roth, 2013; Zheng et al., 2016; Dong et al., 2009, 2012) basically assume two interpretations of a claimed value (i.e., correct and incorrect). By introducing three interpretations of a claimed value (i.e., exactly correct, hierarchically correct, and incorrect), our proposed model represents the generalization tendency and reliability of the sources.
Crowdsourced truth discovery: It is reported in (Dong et al., 2014b) that upto 96% of the false claims are made by extraction errors rather than by the sources themselves. Since crowdsourcing is an efficient way to utilize human intelligence with low cost, it has been successfully applied in various areas of data integration such as schema matching (Fan et al., 2014), entity resolution (Wang et al., 2012), graph alignment (Kim et al., 2017a) and truth discovery (Zheng et al., 2015, 2016). Thus, we utilize crowdsourcing to improve the accuracy of the truth discovery.
It is essential in practice to minimize the cost of crowdsourcing by assigning proper tasks to workers. A popular approach for selecting queries in active learning isuncertainty sampling (Lewis and Gale, 1994; Boim et al., 2012; Kim et al., 2017b; Zheng et al., 2016). It asks a query to reduce the uncertainty of the confidences on the candidate values the most. However, it considers only the uncertainty regardless of the accuracy improvement. QASCA algorithm (Zheng et al., 2015) asks a query with the highest accuracy improvement, but measures the improvement without considering the number of collected claimed values. It can be inaccurate since an additional answer may be less informative for an object which already has many records and answers.
Assume that there are two candidate values of an object with equal confidences. If only a few sources provide the claimed values for the object, an additional answer from a crowd worker will significantly change the confidence distribution. Meanwhile, if hundreds of sources already provide the claimed values for the object, the influence of an additional answer is likely to be very little. Thus, we need to consider the number of collected answers as well as the current confidence distribution. Based on the observation, we develop a new method to estimate the increase of accuracy more precisely by considering the number of collected records and answers. We also present an incremental EM algorithm to quickly measure the accuracy improvement and propose a pruning technique to efficiently assign the tasks to workers.
An overview of our truth discovery algorithm: By combining the proposed task assignment and truth inference algorithms, we develop a novel crowdsourced truth discovery algorithm using hierarchies. As illustrated in Figure 2, our algorithm consists of two components: hierarchical truth inference and task assignment. The hierarchical truth inference algorithm finds the correct values from the conflicting values, which are collected from different sources and crowd workers, using hierarchies. The task assignment algorithm distributes objects to the workers who are likely to increase the accuracy of the truth discovery the most. The proposed crowdsourced truth discovery algorithm repeatedly alternates the truth inference and task assignment until the budget of crowdsourcing runs out. As discussed in (Li et al., 2016b), some workers answer slower than others and increase the latency. However, we do not investigate how to reduce the latency in this work since we can utilize the techniques proposed in (Haas et al., 2015).
Our contributions: The contributions of this paper are summarized below.
We propose a truth inference algorithm utilizing the hierarchical structures in claimed values. To the best of our knowledge, it is the first work which considers both the reliabilities and the generalization tendencies of the sources.
To assign a task which will most improve the accuracy, we develop an incremental EM algorithm to estimate the accuracy improvement for a task by considering the number of claimed values as well as the confidence distribution. We also devise an efficient task assignment algorithm for multiple crowd workers based on the quality measure.
We empirically show that the proposed algorithm outperforms the existing works with extensive experiments on real-life datasets.
In this section, we provide the definitions and the problem formulation of crowdsourced truth discovery in the presence of hierarchy.
For the ease of presentation, we assume that we are interested in a single attribute of objects although our algorithms can be easily generalized to find the truths of multiple attributes. Thus, we use ‘the target attribute value of an object’ and ‘the value of an object’ interchangeably.
A source is a structured or unstructured database which contains the information on target attribute values for a set of objects. In this paper, a source is a certain web page or website and a worker represents a human worker in crowdsourcing platforms. The information of an object provided by a source or a worker is called a claimed value.
Definition 2.1 ().
A record is a data describing the information about an object from a source. A record on an object from a source is represented as a triple where is the claimed value of an object collected from . Similarly, if a worker answers that the truth on an object is , the answer is represented as .
Let be the set of the sources which claimed a value on the object and be the set of candidate values collected from . Each worker in answers a question about the object by selecting a value from .
In our problem setting, we assume that we have a hierarchy tree of the claimed values. If we are interested in an attribute related to locations (e.g., birthplace), would be a geographical hierarchy with different levels of granularity (e.g., continent, country, city, etc.). We also assume that there is no answer with the value of the root in the hierarchy since it provides no information at all (e.g., Earth as a birthplace). We summarize the notations to be used in the paper in Table 2.
Example 2.2 ().
Consider the records in Table 1. Since the source Wikipedia claims that the location of the Statue of Liberty is Liberty Island, it is represented by ‘Liberty Island’ where ‘Statue of Liberty’ and ‘Wikipedia’. If a human worker ‘Emma Stone’ answered Big Ben is in London, it is represented by ‘London’ where ‘Big Ben’ and ‘Emma Stone’.
2.2. Problem Definition
Given a set of objects and a hierarchy tree , we define the two subproblems of the crowdsourced truth discovery.
Definition 2.3 (Hierarchical truth inference problem).
For a set of records collected from the sources and a set of answers from the workers, we find the most specific true value of each object among the candidate values in by using the hierarchy .
Definition 2.4 (Task assignment problem).
For each worker in a set of workers , we select the top- objects from which are likely to increase the overall accuracy of the inferred truths the most by using the hierarchy .
|A data source|
|A crowd worker|
|Claimed value from about|
|Claimed value from about|
|Set of all records collected from the set of sources|
|Set of all answers collected from the set of workers|
|Set of candidate values about|
|Set of sources which post information about|
|Set of workers who answered about|
|Set of objects that source provided a value|
|Set of objects that worker answered to|
|Set of values in which are ancestors of a value except the root in the hierarchy|
|Set of values in which are descendants of|
3. Hierarchical Truth Inference
For the hierarchical truth inference, we first model the trustworthiness of sources and workers for a given hierarchy. Then, we propose a probabilistic model to describe the process of generating the set of records and the set of answers based on the trustworthiness modeling. We next develop an inference algorithm to estimate the model parameters and determine the truths.
3.1. Our Generative Model
Our probabilistic graphical model in Figure 3
expresses the conditional dependence (represented by edges) between random variables (represented by nodes). While the previous works(Demartini et al., 2012; Whitehill et al., 2009; Karger et al., 2011; Raykar et al., 2010) assume that all sources and workers have their own reliabilities only, we assume that each source or worker has its generalization tendency as well as reliability. We first describe how sources and workers generate the claimed values based on their trustworthiness. We next present the model for generating the true value. Finally, we provide the detailed generative process of our probabilistic model.
Model for source trustworthiness: For an object , let be the truth and be the claimed value reported by a source . Recall that is the set of candidate values for an object . Furthermore, we let denote the set of candidate values which are ancestors of a value except for the root in the hierarchy .
There are three relationships between a claimed value and the truth : (1) , (2) and (3) otherwise. Let be the trustworthiness distribution of a source where
is the probability that a claimed value of the sourcecorresponds to the -th relationship. In each relationship, a claimed value is generated as follows:
Case 1 (): The source provides the exact true value with a probability .
Case 2 (): The source provides a generalized true value with a probability . In this case, the claimed value is an ancestor of the truth in . We assume that the claimed value is uniformly selected from .
Case 3 (): The source provides a wrong value not even in . The claimed value is uniformly selected among the rest of the candidate values in .
For the prior of the distribution , we assume that it follows a Dirichlet distribution
, with a hyperparameter
, which is the conjugate prior of categorical distributions.
Let be the set of objects who have an ancestor-descendant relationship in their candidate set. In practice, there may exist some objects whose candidate values do not have an ancestor-descendant relationship. In this case, the probability of the second case (i.e., ) may be underestimated. Thus, if there is no ancestor-descendant relationship between the claimed values about (i.e., ), we assume that a source generates its claimed value with the following probability
Model for worker trustworthiness: Let be the claimed value chosen by a worker among the candidates in for an object . Similar to the model for source trustworthiness, we also assume the three relationships between a claimed value and the truth : (1) , (2) and (3) otherwise. Each worker has its trustworthiness distribution where is the probability that an answer of the worker corresponds to the -th relationship. We assume that the trustworthiness distribution is generated from with a hyperparameter .
Since it is difficult for the workers to be aware of the correct answer for every object, a worker can refer to web sites to answer the question. In such a case, if there is a widespread misinformation across multiple sources, the worker is also likely to respond with the incorrect information. Similar to (Dong et al., 2012; Pasternack and Roth, 2013), we thus exploit the popularity of a value in Cases 2 and 3 to consider such dependency between sources and workers.
Case 1 (): The worker provides the exact true value with a probability .
Case 2 (): The worker provides a generalized true value with a probability . We assume that the claimed value is selected according to the popularity which is the proportion of the records whose claimed value is out of the records with generalized values of .
Case 3 (): The claimed value is selected from the wrong values according to the popularity .
By the above model, the probability of selecting an answer for the truth of an object is formulated as
Similar to the model for source trustworthiness, if there is no ancestor-descendant relationship in the candidate values of an object , the probability of selecting a claimed value is
Model for truth: We introduce the probability distribution over the candidate answers to determine the truth, called confidence distribution. Each object has a confidence distribution where is the probability that the value is the true answer for . We also use a dirichlet prior for the confidence distribution where is a hyperparameter.
Based on the above three models, the generative process of our model works as follows.
Generative process: Given a set of objects , a set of sources and a set of workers , our proposed model assumes the following generative process for the set of records and the set of answers :
Draw for each source
Draw for each worker
For each object
Draw a true value
For each source
Draw a value following
For each worker
Draw a value following
3.2. Estimation of Model Parameters
We now develop an inference algorithm for the generative model. Let be the set of all model parameters where , and . We propose an EM algorithm to find the maximum a posteriori (MAP) estimate of the parameters in our model.
The maximum a posteriori (MAP) estimator: Recall that is the set of records from the sources and is the set of answers from the workers. For every object , each source and each worker generates its claimed values independently. Then, the likelihood of and based on our generative model is
where the probability of generating a claimed value by a source or a worker becomes
Consequently, the MAP point estimator is obtained by maximizing the log-posterior as
where the objective function is
Note that although we assumed that each claimed value is generated independently according to its probability distribution defined in Eq. (5) and (6), the dependencies between sources and workers are already considered in and .
The EM algorithm: We introduce a random variable to represent the type of the relationship between the claimed value and the truth . It is defined as follows:
In the E-step, we compute the conditional distributions of the hidden variables , and under our current estimate of the parameters . Let , , and denote the conditional probabilities , , and , respectively. Using Bayes’ rule, we can update the conditional probabilities as shown in Figure 4 where is the set of descendants of among the candidate values and is the set of candidate values each of which is neither a descendant of the value nor the itself.
In the M-step, we find the model parameters that maximize our objective function . We first add Lagrange multipliers to enforce the constraints of model parameters.
We obtain the following equations for updating the model parameters by taking the partial derivative of the Lagrangian with respect to each model parameter and setting it to zero:
where and are the sets of objects claimed by and , respectively. We infer the truth by choosing the value with the maximum confidence among the candidate values as
Extension to numerical data: In the world wide web, numerical data also have an implicit hierarchy due to the significant digits which carry meaning contributing to its measurement resolution. For example, even though the area of Seoul is , different websites may represent the area in various forms depending on the significant figures (e.g., , ). An existing algorithm (Li et al., 2014a)
to handle numerical data utilizes a weighted sum of the claimed values to consider the distribution of the claimed values. However, such method is sensitive to outliers and thus need a proper preprocessing to remove the outliers. To overcome the drawbacks, we generate the underlying hierarchy in the numerical data by assuming thatis a descendant of if a value can be obtained by rounding off a value . Then, we can use our TDH algorithm to find the truths in numerical data by taking into account the relationship between the values in the implicit hierarchy. Our algorithm is also robust to the outliers with extremely small or large value since we estimate the truth by selecting the most probable value from the candidate values rather than computing a weighted average of the claimed values.
4. Task Assignment to Workers
In this section, we propose a task assignment method to select the best objects to be assigned to the workers in crowdsourcing systems. We first define a quality measure of tasks called Expected Accuracy Increase (EAI) and develop an incremental EM algorithm to quickly estimate the quality measure. Finally, we present an efficient algorithm for assigning the questions to each worker in a set of workers based on the measure.
4.1. The Quality Measure
Given a worker , our goal is to choose an object to be assigned to the worker which is likely to increase the accuracy of the estimated truths the most. Thus, we define a quality measure for a pair of worker and an object based on the improvement of the accuracy. As discussed in (Zheng et al., 2015), the improvement of the accuracy by a task can be estimated by using the difference between the highest confidence as follows:
where is the estimated confidence on if the worker answers about an object .
The quality measure used by QASCA: The QASCA(Zheng et al., 2015) algorithm calculates the estimated confidence by using the current confidence distribution and the likelihood of the answer given the truth as
where is a sampled claimed value. There are two drawbacks in the quality measure of QASCA. First, since it computes the estimated confidence based on a sampled answer , the value of the quality measure is very sensitive to the sampled answer. In addition, QASCA does not consider the number of claimed values collected so far and the estimated confidence may not be accurate. For instance, assume that there exist two objects which have identical confidence distributions. If one of the objects already has many collected claimed values, an additional answer is not likely to change the confidence significantly. Thus, task assignment algorithms should select another object who has a smaller number of collected records and answers.
Our quality measure: To avoid the sensitiveness caused by sampling answers, we develop a new quality measure Expected Accuracy Improvement (EAI) which is obtained by taking the expectation to Eq. (13). That is,
By the definition of expectation, becomes
where is the conditional confidence when a worker answers with about the object .
Since can be computed by Eq. (6), to compute by Eq. (15), we need the estimation of the conditional confidence with an additional answer . Recall that the estimated confidence computed by QASCA may not be accurate because it does not consider the collected records and answers so far. To reduce the error, we use them to compute the conditional confidence . We can compute the conditional confidence by applying the EM algorithm in Section 3.2 with the collected records and answers including . However, since it is computationally expensive, we next develop an incremental EM algorithm.
4.2. The Incremental EM Algorithm
Let be the objective function in Eq. (7) after obtaining an additional answer . Then, we have
by adding the related term of the additional answer (log likelihood of the additional answer) to Eq. (8). Instead of running the iterative EM algorithm in Section 3.2, we incrementally perform a single EM-step to speed up for only the additional answer with the current model parameters and the above objective function.
E-step: Since we use the current model parameters, the probabilities of the hidden variables for collected records and answers are not changed. Thus, we only need to compute the conditional probabilities of the hidden variable given the additional answer as
based on the equation for used at the E-step in Figure 4.
M-step: For the objective function , we obtain the following equation of the M-step for the confidence distribution with the additional answer
by adding the related terms and to the numerator and the denominator of the update equation in Eq. (9), respectively. Let and be the numerator and the denominator in Eq. (9), respectively. Then, the above equation can be rewritten as
Since and are proportional to the number of the existing claimed values, the confidence will be changed very little if there are many claimed values already. Thus, we can overcome the second drawback of QASCA. Since s and s are repeatedly used to compute , our truth inference algorithm keeps s and s in main memory to reduce the computation time.
Time complexity analysis: To calculate by Eq. (15), is computed times and is calculated for every pair of and (i.e., times). Moreover, computing and take time. Thus, it takes time to compute by Eq. (14). In reality, is very small compared to , and . In addition, by utilizing the pruning technique in the next section, we can significantly reduce the computation time. Therefore, the task assignment step can be performed within a short time compared to the truth inference. The execution time for each step will be presented in the experiment section.
4.3. The Task Assignment Algorithm
To find the objects to be assigned to each worker, we need to compute for all pairs of and . To reduce the number of computing , we develop a pruning technique by utilizing an upper bound of .
An upper bound of EAI: We provide the following lemma which allows us to compute an upper bound .
Lemma 4.1 ().
(Upper Bound of Expected Accuracy Increase) For an object and a worker , we have