Our work is motivated by a real estimation problem associated with the ongoing conflict in Syria. While deaths are tremendously well documented, it is hard to know how many unique individuals have been killed from conflict-related violence in Syria. Since March 2011, increasing reports of deaths have appeared in both the national and international news. There are many inconsistencies from various media sources, which is inherent due to the data collection process and the fact that reported victims are documented by multiple sources. Thus, our ultimate goal is to determine an accurate number of documented, identifiable deaths (with associated standard errors) because such information may contribute to future transitional justice and accountability measures. For instance, statistical estimates of death counts have been introduced as evidence in national court cases and international tribunals investigating the responsibility of state leaders for crimes against humanity (Grillo, 2016).
The main challenge with reliable death estimation of the Syrian data set is the fact that individuals who are documented as dead are often duplicated in the data sets. To address this challenge, one could employ entity resolution (de-duplication or record linkage), which refers to the task of removing duplicated records in noisy datasets that refer to the same entity (Tancredi and Liseo, 2011; Sadinle et al., 2014; Bhattacharya and Getoor, 2006; Gutman, Afendulis and Zaslavsky, 2013; McCallum and Wellner, 2004; Fellegi and Sunter, 1969)
. Entity resolution is fundamental in many large data processing applications. Informally, let us assume that each entity (records) is a vector in. Then given a data set of records aggregated from many data sources with possibly numerous duplicated entities perturbed by noise, the task of entity resolution is to identify and remove the duplicate entities. For a review of entity resolution see (Winkler, 2006; Christen, 2012; Liseo and Tancredi, 2013).
One important subtask of entity resolution is estimating the number of unique entities (records) out of duplicated entities, which we call unique entity estimation. Entity resolution is a more difficult problem because it requires one to link each entity to its associated duplicate entities. To obtain high-accuracy entity resolution, the algorithms must at least evaluate a significant amount of pairs for potential duplicates to ensure a link is not missed. Due to this (and to the best of our knowledge), accurate entity resolution algorithms scale quadratically or higher ( making them computationally intractable for large data sets. Reducing the computational cost in entity resolution is known as blocking, which, via deterministic or probabilistic algorithms, places similar records into blocks or bins (Christen, 2012; Steorts et al., 2014)
. The computational efficiency comes at the cost of missed links and reduced accuracy for entity resolution. Further, it is not clear if we can use these crude but cheap entity resolution sub-routines for unbiased estimation of unique entities with strong statistical guarantees.
The primary focus of this paper is on developing a unique entity estimation algorithm that is motivated by the ongoing conflict in Syria and has the following desiderata:
The estimation cost should be significantly less than quadratic (). In particular, any methodology requiring one to evaluate all pairs for linkage is not suitable. This is crucial for the Syrian data set and other large, noisy data sets (Section 1.3).
To ensure accountability regarding estimating the unique number of documented identifiable victims in the Syrian conflict, it is essential to understand the statistical properties of any proposed estimator. Such a requirement eliminates many heuristics and rule-based entity resolution tasks, where the estimates may be very far from the true value.
In most real entity resolution tasks, duplicated data can occur with arbitrarily large changes including missing information, which we observe in the Syrian data set, and standard modeling assumptions may not hold due to the noise inherent in the data. Due to this, we prefer not to make strong modeling assumptions regarding the data generation process.
1.1 Related Work for Unique Entity Estimation
The three aforementioned desiderata eliminate all but random sampling-based approaches. In this section, we review them briefly.
To our knowledge, only two random sampling based methodologies satisfy such requirements. Frank (1978) proposed sampling a large enough subgraph to estimate the total number of connected components based on the properties of the sub-sampled subgraph. Also, Chazelle, Rubinfeld and Trevisan (2005)
proposed finding connected components with high probability by sampling random vertices and then visiting their associated components using breadth-first search (BFS). One major issue with random sampling is that most sampled pairs are unlikely to be matches (no edge) providing nearly no information, as the underlying graph is generally very sparse in practice. Randomly sampling vertices and running BFS required byChazelle, Rubinfeld and Trevisan (2005) are very likely to result in singleton vertices because many records are themselves unique in entity resolution data sets. In addition, finding all possible connections of a given vertex would require query for edges. A query for edges corresponds to the query for actual link between two records. Sub-sampling a sub-graph, as in Frank (1978), of size requires edge queries to completely observe it. Thus, should be reasonably small in order to scale. Unfortunately, requiring a small hurts the variance of the estimator. We show that the accuracy of both aforementioned methodologies is similar to the non-adaptive variant of our estimator which has provably large variance. In addition, we show both theoretically and empirically that the methodologies based on random sampling lead to poor estimators.
While some methods have recently been proposed for accurate estimation of unique records, they belong to the Bayesian literature and have difficulty scaling due to the curse of dimensionality with Markov chain Monte CarloSteorts, Hall and Fienberg (2016); Sadinle et al. (2014); Tancredi and Liseo (2011). The evaluation of the likelihood itself is quadratic. Furthermore, they rely on a strong assumption about the specified generative models for the duplicate records. Given such computational challenges with the current state of the methods in the literature, we take a simple approach, especially given the large and constantly growing data sets that we seek to analyze. We focus on practical methodologies that can easily scale to large data sets with minimal assumptions. Specifically, we propose a unique entity estimation algorithm with sub-quadratic cost, which can be reduced to approximating the number of connected components in a graph with sub-quadratic queries for edges (Section 3.1).
The rest of the paper proceeds as follows. Section 1.2 provides our motivational application from the Syrian conflict and Section 1.3 remarks on the main challenges of the Syrian data set and our proposed methodology. Section 2 provides background on variants of locality sensitive hashing (LSH), which is essential to our proposed methodology. Section 3 provides our proposed methodology for unique entity estimation, which is the first formalism of using efficient adaptive LSH on edges to estimate the connected components with sub-quadratic computational time. (An example of our approach is given in section 3.2). More specifically, we draw connections between our methodology and random and adaptive sampling in section 3.3, where we show under realistic assumptions that our estimator is theoretically unbiased and has provably low variance. In addition, in section 3.5, we compare random and adaptive sampling for the Syrian data set, illustrating the strengths of adaptive sampling. In section 3.6, we introduction the variant of LSH used in our paper. Section 3.7 provides our complete algorithm for unique entity estimation. Section 4 provides evaluations of all the related estimation methods on three real data sets from the music and food industries as well as official statistics. Section 5 reports the documented identifiable number of deaths in the Syrian conflict (with a standard error).
1.2 The Syrian Conflict
Thanks to Human Rights Data Analysis Group (HRDAG), we have access to four databases from the Syrian conflict which cover roughly the same period, namely March 2011 – April 2014, namely, the Violation Documentation Centre (VDC), Syrian Center for Statistics and Research (CSR-SY), Syrian Network for Human Rights (SNHR), and Syria Shuhada website (SS). Each database lists a different number of recorded victims killed in the Syrian conflict, along with available identifying information including full Arabic name, date of death, death location, and gender.111These databases include documented identifiable victims and not those who are missing in the conflict, hence, any estimate reported only refers to the data at hand.
Since the above information is collected indirectly, such as through friends and religious leaders, or traditional media resources, it naturally comes with many challenges. The data set has biases, spelling errors, and missing values. In addition, it is well known that there are duplicate entities present in the data sets, making estimation more difficult. The ambiguities in Arabic names make the situation significantly worse as there can be a large textual difference between the full and short names in Arabic. (It is not surprising that the Syrian data set has such biases given that the data is collected in the midst of a conflict).
Such ambiguities and lack of additional information make entity resolution on this data set considerably challenging (Price et al., 2014)
. Owing to the significance of the problem, HRDAG has provided labels for a large subset of the data set. More specifically, five different human experts from the HRDAG manually reviewed pairs of records in the four data sets, classifying them as matches if referred to the same entity and non-matches otherwise.Our first goal is to accurately estimate the number of unique victims.
Obtaining a match or non-match label of a given record pair may require momentous cost such as manual human supervision or involving sophisticated machine learning. Given that coming up with hand-matched data is a costly process,our second goal is to provide a proxy, automated mechanism to create labeled data. (More information regarding the Syrian data set can be found in Appendix A).
1.3 Challenges and Proposed Solutions
Consider evaluating the Syrian data set using all-to-all records comparisons to remove duplicate entities. With approximately 354,000 records from the Syrian data set, we have around 63 billion pairs (). Therefore, it is impractical to classify all these pairs as matches/non-matches reliably. We cannot expect a few experts (five in our case) to manually label 63 billion pairs. A simple computation of all pairwise similarity (63 billion) takes more than 8 days on a heavyweight machine that can run 56 threads in parallel (28 cores in total). In general, this quadratic computational cost is widely considered infeasible for large data sets. Algorithmic labeling of every pair, even if possible for relatively small datasets, is neither reliable nor efficient. Furthermore, it is hard to understand the statistical properties of algorithmic labelling of pairs. Such challenges, therefore, motivate us to focus on the estimation algorithm with constraints mentioned in Section 1.
Our Contributions: We formalize unique entity estimation as approximating the number of connected components in a graph with sub-quadratic
computational time. We then propose a generic methodology that provides an estimate in sample (with standard errors). Our proposal leverages locality sensitive hashing (LSH) in a novel way for the estimation process, with the required computational complexity that is less than quadratic. Our proposed estimator is unbiased and has provably low variance compared to random sampling based approaches. To the best of our knowledge this is the first use of LSH for unique entity estimation in an entity resolution setting. Our unique entity estimation procedure is broadly applicable to many applications, and we illustrate this on three additional real, fully labelled, entity resolution data sets, which include the food industry, the music industry, and an application from official statistics. In the absence of ground truth information, we estimate that the number of documented identifiable deaths for the Syrian conflict is 191,874, with standard deviation of 1,772, reported casualties, which is very close to the 2014 HRDAG estimate of 191,369. This clearly demonstrates the power of our efficient estimator in practice, which does not rely on any strong modeling assumptions. Out of 63 billion possible pairs, our estimator only queries around 450,000 adaptively sampled pairs (
) for labels, yielding a 99.99% reduction. The labelling was done using support vector machines (SVMs) trained on a small number of hand-matched, labeled examples provided by five domain experts. Our work is an example of the efforts required to solve a real noisy challenging problem where modeling assumptions may not hold.
2 Variants of Locality Sensitive Hashing (LSH)
In this section, we first provide a review of LSH and min-wise hashing, which is crucial to our proposed methodology. We then introduce a variant of LSH — Densified One Permutation Hashing (DOPH), which is essential to our proposed algorithm for unique entity estimation in terms of scalability.
In entity resolution tasks, each record can be represented as a string of information. For example, each record in the Syrian data set can be represented as a short text description of the person who died in the conflict. In this paper, we use a k-grams based shingle representation, which is the most common representation of text data and naturally gives a set token (or k-grams). That is, each record is treated as a string and is replaced by a “bag” (or “multi-set”) of length- contiguous sub-strings that it contains. Since we will use a k-gram based approach to transform the records, our representation of each record will also be a set, which consists of all the -contiguous characters occurring in record string. As an illustration, for the record BAKER, TED, we separate it into a 2-gram representation. The resulting set is the following:
In another example, consider Sammy, Smith, whose 2-gram set representation is
We now have two records that have been transformed into a 2-gram representation. Thus, for every record (string) we obtain a set , where the universe is the set of all possible -contiguous characters.
2.2 Locality Sensitive Hashing
LSH—a two-decade old probabilistic technique and method for dimension reduction—comes with sound mathematical formalism and guarantees. LSH is widely used in computer science and database engineering as a way of rapidly finding approximate nearest neighbors (Indyk and Motwani, 1998; Gionis et al., 1999). Specifically, the variant of LSH that we utilize is scalable to large databases, and allows for similarity based sampling of entities in less than a quadratic amount of time.
In LSH, a hash function is defined as where is the hash code and the hash function. A hash table is a data structure that is composed of buckets (not to be confused with blocks), each of which is indexed by a hash code. Each reference item is placed into a bucket
More precisely, LSH is a family of function that map vectors to a discrete set, namely, , where is in finite range. Given this family of functions, similar points (entities) are likely to have the same hash value compared to dissimilar points (entities). The notion of similarity is specified by comparing two vectors of points (entities), and We will denote a general notion of similarity by In this paper, we only require a relaxed version LSH, and we define this below. Formally, a LSH is defined by the following definition below:
(Locality Sensitive Hashing (LSH)) Let and suppose is chosen uniformly from a family Given a similarity metric, , is locality sensitive if then where is the probability over the uniform sampling of .
The above definition is sufficient condition for a family of function to be LSH. While many popular LSH families satisfy the aforementioned property, we only require this condition for the work described herein. For a complete review of LSH, we refer to Rajaraman and Ullman (2012).
One of the most popular forms of LSH is minhashing (Broder, 1997), which has two key properties — a type of similarity and a type of dimension reduction. The type of similarity used is the Jaccard similarity and the type of dimension reduction is known as the minwise hash, which we now define.
Let denote the set of all binary dimensional vectors, while refers to the set of all dimensional vectors (of records). Note that records can be represented as a binary vector (or set) representation via shingling, BoW, or combining these two methods. More specifically, given two record sets (or equivalently binary vectors) the Jaccard similarity between is
where is the cardinality of the set.
More specifically, the minwise hashing family applies a random permutation , on the given set , and stores only the minimum value after the permutation mapping, known as the minhash. Formally, the minhash is defined as , where is a hash function.
Given two sets and , it can be shown by an elementary probability argument that
where the probability is over uniform sampling of . It follows from Equation 1 that minhashing is a LSH family for the Jaccard similarity.
Remark: In this paper, we utilize a shingling based approach, and thus, our representation of each record is likely to be very sparse. Moreover, Shrivastava and Li (2014a) showed that minhashing based approaches are superior compared to random projection based approaches for very sparse datasets.
2.3.1 Densified One Permutation Hashing (DOPH)
For realistically sized entity resolution tasks, sampling based on LSH requires hundreds of hashes (Section 3.6.1). It is well known that computing several minwise hashes of data is a very costly operation (Li, Shrivastava and Konig, 2012). Fortunately, recent literature on DOPH has shown that it is possible to compute several hundreds or even thousands of hashes of the data vector in one pass with nearly identical statistical properties as minwise hashes (Shrivastava and Li, 2014b, c; Shrivastava, 2017). In this paper, we will use the most recent variant of DOPH, which is significantly faster in practice compared to minwise hashing (Shrivastava, 2017). Throughout the paper, our mention of minwise hashing will refer to the DOPH algorithm for computing minhashes, which we have just mentioned. The complete details can be found in the aforementioned papers.
3 Unique Entity Estimation
In this section, we provide notation used throughout the rest of the paper and provide an illustrative example. We then propose our estimator, which is unbiased and has provably low variance. In addition, random sampling is a special case of our procedure as explained in section 3.5. Finally, we present our unique entity estimation algorithm in section 3.3.
The problem of unique entity estimation can be reduced to approximating the number of connected components in a corresponding graph. Given a data set with size , we denote the records as
Next, we define
Let us represent the data set by a graph with vertices Let vertex correspond to record and vertex correspond to record . Then let edge represent the linkage between records of and (or vertex and ). More specifically, we can represent this by the following relationship:
3.2 Illustrative Example
In this section, we provide an illustrative example of how six records are mapped to a graph . Consider record 3 (John) and record 5 (Johnathan) which correspond to the same entity (John Schaech). In , there is an edge that connect these records, denoted by and Now consider records 2, 4, and 6, which all refer to the same entity (Nicholas Cage). In , there are edges and that connect these records, denoted by and Observe that each connected component in is a unique entity and also a clique. Therefore, our task is reduced to estimating the number of connected components in .
3.3 Proposed Unique Entity Estimator
In this section, we propose our unique entity estimator and provide assumptions that are necessary for our estimation procedure to be practical (scalable).
Since we do not observe the edges of (the linkage), inferring whether there is an edge between two nodes (or whether two records are linked) can be costly, i.e., . Hence, one is constrained to probe a small set with of pairs and query if they have edges. The aim is to use the information about to estimate the total number of connected components accurately. More precisely, given the partial graph , where , one wishes to estimate the connected components of
One key property of our estimation process is that we do not make any modeling assumptions of how duplicate records are generated, and it is not immediately clear how we can obtain unbiased estimation. For sake of simplicity, we first assume the existence of an efficient (sub-quadratic) process that samples a small set (near-linear size) of edges , such that every edge in the original graph has (reasonably high) probability of being in . Thus, set , even though small, contains fraction of the actual edges. For sparse graphs, as in the case of duplicate records, such a sampler will be far more efficient than random sampling. Based on this assumption, we will first describe our estimator and its properties. We then show why our assumption about existence of adaptive sampler is practical by providing an efficient sampling process based on LSH (Section 3).
Remark: It is not difficult to see that random sampling is a special case when which, as we show later, is a very small number for any accurate estimation.
Our proposed estimator and corresponding algorithm obtains the set of vertex pairs (or edges) through an efficient (adaptive) sampling process and queries whether there is an edge (linkage) between each pair in . Respectively, after the ground truth querying, we observe a sub-sampled graph , consisting of vertices returned by the sampler. Let be the number of connected component of size in the observed graph , i.e., is the number of singleton vertices, is the number of isolated edges, etc. in . It is worth noting that every connected component in is a part of some clique (maybe larger) in . Let denote the number of connected components (clique) of size in the original (unobserved) graph .
Observe that under the sampling process, any original connected component, say (clique), will be sub-sampled and can appear as some possibly smaller connected component in . For example, a singleton set in will remain the same in . An isolated edge, on the other hand, can appear as an edge in with probability and as two singleton vertices in with probability . A triangle can decompose into three possibilities with probability shown in Figure 2. Each of these possibilities provides a linear equation connecting to . These equations up to cliques of size three are
Since we observe , we can solve for the estimator of each and compute the number of connected components by summing up all .
Unfortunately, this process quickly becomes combinatorial, and in fact, is at least hard (Provan and Ball, 1983) to compute for cliques of larger sizes. A large clique of size can appear as many separate connected components and the possibilities of smaller size components it can break into are exponential (Aleksandrov, 1956). Fortunately, we can safely ignore large connected components without significant loss in estimation for two reasons. First, in practical entity resolution tasks, when is large and contains at least one string-valued feature, it is observed that most entities are replicated no more than three or four times. Second, a large clique can only induce large errors if it is broken into many connected components due to undersampling. According to Erdos and Rényi (1960), it will almost surely stay connected if is high, which is the case with our sampling method.
Assumption: As argued above, we safely assume that the cliques of sizes equal to or larger than 4 in the original graph would retain their structures, i.e., . With this assumption, we can write down the formula for estimating , , by solving Equations 2–4 as,
It directly follows that our estimator, which we call the Locality Sensitive Hashing Estimator (LSHE) for the number of connected components is given by
3.4 Optimality Properties of LSHE
We now prove two properties of our unique entity estimator, namely, that it is unbiased and that is has provably lower variance than random sampling approaches.
Assuming , we have
The above estimator is unbiased and the variance is given by Equation 9.
Theorem 2 proves the variance of our estimator is monotonically decreasing with .
is monotonically decreasing when increases in range .
First order derivative of is negative when .
3.5 Adaptive Sampling versus Random Sampling
Before we describe our adaptive sampler, we briefly quantify the advantages of an adaptive sampling over random sampling for the Syrian data set by computing the differences between their variances. Let be the probability that an edge (correct match) is sampled. On the Syrian data set, our proposed sampler, described in next section, empirically achieves , by reporting around 450,000 sampled pairs () out of the 63 billion possibilities (). Substituting this value of , the corresponding variance can be calculated from Equation 9 as
Turning to plain random sampling of edges, in order to achieve the same sample size above leads to as low as . With such minuscule , the resulting variance is
Thus, the variance for random sampling is roughly times the number of duplicates in the data set and the number of triplets in the data set.
In section 4, we illustrate that two other random sampling based algorithms of Chazelle, Rubinfeld and Trevisan (2005) and Frank (1978) also have poor accuracy compared to our proposed estimator. The poor performance of random sampling is not surprising from a theoretical perspective, and illustrates a major weakness empirically for the task of unique entity estimation with sparse graphs, where adaptive sampling is significantly advantageous.
3.6 The Missing Ingredient: (K,L)-LSH Algorithm
Our proposed methodology, for unique entity estimation, assumes that we have an efficient algorithm that adaptively samples a set of record pairs, in sub-quadratic time. In this section, we argue that using a variant of LSH (Section 2) we can construct such an efficient sampler.
As already noted, we do not make any modeling assumptions on the generation process of the duplicate records. Also, we cannot assume that there is a fixed similarity threshold, because in real datasets duplicates can have arbitrarily large similarity. Instead, we rely on the observation that record pairs with high similarity have a higher chance of being duplicate records. That is, we assume that when two entities and are similar in their attributes, it is more likely that they refer to the same entities (Christen, 2012).222The similarity metric that we use to compare sets of record strings is the Jaccard similarity. We note that this probabilistic observation is the weakest possible assumption, and almost always true for entity resolution tasks because linking records by a similarity score is one simple way of approaching entity resolution (Christen, 2012; Winkler, 2006; Fellegi and Sunter, 1969).
The similarity between entities (records) naturally gives us a notion of adaptiveness. One simple adaptive approach is to sample records pairs with probability proportional to their similarity. However, as a prerequisite for such sampling, we must compute all the pairwise similarities and associated probability values with every edge. Computing such a pairwise similarity score is a quadratic operation () and is intractable for large datasets. Fortunately, recent work has shown that (Spring and Shrivastava, 2017a; Spring and Shrivastava, 2017b; Luo and Shrivastava, 2017) it is possible to sample pairs adaptively in proportion to the similarity in provably sub-quadratic time using LSH, which we describe in the next section.
3.6.1 (K,L)-LSH Algorithm and Sub-quadratic Adaptive Sampling
We leverage a very recent observation associated with the traditional parameterized LSH algorithm. The parameterized LSH algorithm is a popular similarity search algorithm, which given a query , retrieves element from a preprocessed data set in sub-linear time () with probability . Here, denotes the Jaccard similarity between the query and the retrieved data vector . Our proposed method leverages this -parameterized LSH Algorithm, and we briefly describe the algorithm in this section. For complete details refer to Andoni and Indyk (2004).
Before we proceed, we define hash maps and keys. We use hash maps, where every integer (or key) is associated with a bucket (or a list) of records. In a hash map, searching for the bucket corresponding to a key is a constant time operation. Please refer to algorithms literature (Rajaraman and Ullman, 2012) for details on hashing and its computational complexity. Our algorithm will require several hash maps, of them, where a record is associated with a unique bucket in every hash map. The key corresponding to this bucket is determined by minwise hashes of the record . We encourage readers to refer to Andoni and Indyk (2004) for implementation details.
More precisely, let , and be minwise hash functions (Equation 1) with each minwise hash function formed by independently choosing the underlying permutation . Next, we construct meta-hash functions (or the keys) , where each of the ’s is formed by combining different minwise hash functions. For this variant of the algorithm, we need a total of functions. With such meta-hash functions, the algorithm has two main phases, namely the data pre-processing and the sampling pairs phases, which we outline below.
Data Preprocessing Phase: We create different hash maps (or hash tables), where every hash values maps to a bucket of elements. For every record in the dataset, we insert in the bucket associated with the key , in hash map . To assign -tuples (meta-hash) to a number in a fixed range, we use some universal random mapping function to the desired address range. See Andoni and Indyk (2004); Wang, Shrivastava and Ryu (2017) for details.
Sample Pair Reporting: For every record in the dataset and from each table , we obtain all the elements in the bucket associated with key , where . We then take the union of the buckets obtained from the hash tables, and denote this (aggregated) set by We finally, report pairs of records , where .
The (K,L)-LSH Algorithm reports a pair with probability , where is the Jaccard Similarity between record pairs
Proof: Since all the minwise hashes are independent due to an independent sampling of permutations, the probability that both and belong to the same bucket in any hash table is . Note from equation 1, each meta-hash agreement has probability . Therefore, the probability that pair is missed by all the tables is precisely , and thus, the required probability of successful retreival is the complement.
The probabilistic expression is a monotonic function of the underlying similarity associated with the LSH. In particular, higher similarity pairs have more chance of being retrieved. Thus, LSH provides the required sampling that is adaptive in similarity and is sub-quadratic in running time.
3.6.2 Computational Complexity
The computational compexity for sampling with records is . The procedure requires computing minwise hashes for each record. This step is followed by adding every record to hash tables. Finally, for each record, we aggregate buckets to form sample pairs. The result of monotonicity and adaptivity of the samples applies to any value of and . We choose such that we are able to get samples in sub-quadratic time. We further tune and using cross-validation to limit the size of our samples. In section 5.3, we evaluate the effect of varying and in terms of the recall and reduction ratio. (For a review of the recall and reduction ratio, we refer to Christen (2012).) We address the precision at the very end of our experimental procedure to ensure that the recall, reduction ratio, and precision of our proposed unique entity estimation procedure are all as close to 1 as possible while ensuring that the entire algorithm is computationally efficient. For example, on the Syrian data set, we can generate 450,000 samples in less than with an adaptive sampling probability (recall) as high as . On the other hand, computing all pairwise similarities (63 billion) takes more than 8 days on the same machine with 28 cores capable of running 56 threads in parallel.
Next, we describe how this LSH sampler is related to the adaptive sampler described earlier in Section 3.3.
3.6.3 Underlying Assumptions and Connections with
Recall that we can efficiently sample record pairs with probability Since we are not making any modeling assumptions, we cannot directly link this probability to , the probability of sampling the right duplicated pair (or linked entities) as required by our estimator LSHE. In the absence of any knowledge, we can get the estimate of using a small set of labeled linked pairs . Specifically, we we can estimate the value of by counting the fraction of matched pairs (true edges) from reported by the sampling process.
Note that in practice there is no similarity threshold that guarantees that two record pairs are duplicate records. That is, it is difficult in practice to know a fixed where ensures that and are the same entities. However, the weakest possible and reasonable assumption is that high similarity pairs (textual similarity of records) should have higher chances of being duplicate records than lower similarity pairs.
Formally, this assumption implies that there exists a monotonic function of similarity such that the probability of any being a duplicate record is given by . Since our sampling probability is also a monotonic function of , we can also write
where is composed with which is the inverse of . Unfortunately, we do not know the form of or .
Instead of deriving (or ), which requires additional implicit assumptions on the form of the functions, our process estimates directly. In particular, the estimated value of is a data dependent mean-field approximation of , or rather,
Crucially, our estimation procedure does not require any modeling assumptions regarding the generation process of the duplicate records, which is significant for noisy data sets, where such assumptions typically break.
3.6.4 Why LSH?
Although there are several rule-based blocking methodologies, LSH is the only one that is also a random adaptive sampler. In particular, consider a rule-based blocking mechanism, for example on the Syrian data set, which might block on the date of death feature. Such blocking could be a very reasonable strategy for finding candidate pairs. Note that it is still very likely that duplicate records can have different dates of death because the information could be different or misrepresented. In addition, such a blocking method is deterministic, and different independent runs of the blocking algorithm will report the same set of pairs. Even if we find reasonable candidates, we cannot up-sample the linked records to get an unbiased estimate. There will be a systematic bias in the estimates, which does not have any reasonable correction. In fact, random sampling to our knowledge is the only known choice in the existing literature for an unbiased estimation procedure; however, as already mentioned, random uninformative sampling is likely to be very inaccurate.
LSH, on the other hand, can also be used as a blocking mechanism (Steorts et al., 2014). It is, however, more than just a blocking scheme; it is a provably adaptive sampler. Due to randomness in the blocking, different runs of sampler lead to different candidates, unlike deterministic blocking. We can also average over multiple runs to even increase the concentration of our estimates. The adaptive sampling view of LSH has come to light very recently (Spring and Shrivastava, 2017a; Spring and Shrivastava, 2017b; Luo and Shrivastava, 2017). With adaptive sampling, we get much sharper unbiased estimators than the random sampling approach. To our knowledge, this is the first study of LSH sampling for unique entity estimation.
3.7 Putting it all Together: Scalable Unique Entity Estimation
We now describe our scalable unique entity estimation algorithm. As mentioned earlier, assume that we have a data set that contains a text representation of the records. Suppose that we have a reasonably sized, manually labeled training set . We will denote the set of sampled pairs of records given by our sampling process as . Note, each element of is a pair. Then our scalable entity resolution algorithm consists of three main steps, with the total computational complexity . In our case, we will always have and (in fact, will be a small constant), which ensures that the total cost is strictly sub-quadratic. The complete procedure is summarized in Algorithm 1.
Adaptively Sample Record Pairs (): We regard each record
as a short string and replace it by an “n-grams” based representation. Then one computesminwise hashes of each corresponding string. This can be done in a computationally efficient manner using the DOPH algorithm (Shrivastava, 2017), which is done in data reading time. Next, once these hashes are obtained, one applies the sampling algorithm described in section 3 in order to generate a large enough sample set, which we denote by . For each record, the sampling step requires exactly hash table queries, which are themselves memory lookups. Therefore, the computational complexity of this step is
Query each Sample Pairs: Given the set of sampled pairs of records from Step 1, for every pair of records in , we query whether these record pairs are a match or non-match. This step requires,
, queries for the true labels. Here, one can use manually labeled data if it exists. In the absence of manually labeled data, we can also use a supervised algorithm, such as support vector machines or random forests, that is trained on the manually labeled set(Section 5).
Estimate : Given the sampled set of record pairs we need to know the value of , the probability that any given correct pair is sampled. To do so, we use the fraction of true pairs sampled from the labeled training set The sampling probability can be estimated by computing the fraction of the matched pairs of training set records appearing in . That is, we estimate (unbiasedly) by
If is stored in a dictionary, then this step can be done on the fly while generating samples. It only costs extra work to create the dictionary.
Count Different Connected Components in (): The resulting matched sampled pairs, after querying every sample for actual (or inferred) labels, form the edges of . We now have complete information about our sampled graph . We can now traverse and count all sizes of connected components in to obtain , , and so on. Traversing the graph has computational complexity time using Breadth First Search (BFS).
Estimate the Number of Connected Components in (): Given the values of , , , and we use equation 7 to compute the unique entity estimator LSHE.
We evaluate the effectiveness of our proposed methodology on the Syrian data set and three additional real data sets, where the Syrian data set is only partially labeled, while the other three data sets are fully labeled. We first perform evaluations and comparisons on the three fully labeled data sets, and then give an estimate of the documented number of identifiable victims for the Syrian data set.
Restaurant: The Restaurant data set contains 864 restaurant records collected from Fodor’s and Zagat’s restaurant guides.333Originally provided by Sheila Tejada, downloaded from http://www.cs.utexas.edu/users/ml/riddle/data.html. There are a total of 112 duplicate records. Attribute information contains name, address, city, and cuisine.
CD: The CD data set that includes 9,763 CDs randomly extracted from freeDB.444https://hpi.de/naumann/projects/repeatability/datasets/cd-datasets.html. There are a total of 299 duplicate records. Attribute information consists of 106 total features such as artist name, title, genre, among others.
Voter: The Voter data has been scraped and collected by Christen (2014) beginning in October 2011. We work with a subset of this data set containing 324,074 records. There are a total of 68,627 duplicate records. Attribute information contains personal information on voters from North Carolina including full name, age, gender, race, ethnicity, address, zip code, birth place, and phone number.
Syria: The Syria data set comprises data from the Syrian conflict, which covers the same time period, namely, March 2011 – April 2014. This data set is not publicly available and was provided by HRDAG. The respective data sets come from the Violation Documentation Centre (VDC), Syrian Center for Statistics and Research (CSR-SY), Syrian Network for Human Rights (SNHR), and Syria Shuhada website (SS). Each database lists a different number of recorded victims killed in the Syrian conflict, along with available identifying information including full Arabic name, date of death, death location, and gender.555These databases include documented identifiable victims and not those who are missing in the conflict. Hence, any estimate reported only refers to the data at hand.
The above datasets cover a wide spectrum of different varieties observed in practice. For each data set, we report summary information in Table 1.
|DBname||Domain||Size||# Matching Pairs||# Attributes||# Entities|
4.1 Evaluation Settings
In this section, we outline our evaluation settings. We denote Algorithm 1 as the LSH Estimator (LSHE). We make comparisons to the non-adaptive variant of our estimator (PRSE), where we use plain random sampling (instead of adaptive sampling). This baseline uses the same procedure as our proposed LSHE, except that the sampling is done uniformly. A comparison with PRSE quantifies the advantages of the proposed adaptive sampling over random sampling. In addition, we implemented the two other known sampling methods, for connected component estimation, proposed in Frank (1978) and Chazelle, Rubinfeld and Trevisan (2005). For convenience, we denote them as Random Sub-Graph based Estimator (RSGE), and BFS on Random Vertex based Estimator (BFSE) respectively. Since the algorithms are based on sampling (adaptive or random), to ensure fairness, we fix a budget as the number of pairs of vertices considered by the algorithm. Note that any query for an edge is a part of the budget. If the fixed budget is exhausted, then we stop the sampling process and use the corresponding estimate, using all the information available.
We briefly describe the implementation details of the four considered estimators below:
LSHE: In our proposed algorithm, we use the () parameterized LSH algorithm to generate samples of record pairs using Algorithm 3, where recall and control the resulting sample size (section 5.3). Given as an input to Algorithm 1, we use the sample size as the value of the fixed budget . Table 2 gives different sample budget sizes (with the corresponding and ) and corresponding values of for selected samples in three real data sets.
PRSE: For a fair comparison, in this algorithm, we randomly sample the same number of record pairs used by LSHE. We then perform the same estimation process as LSHE but instead use which corresponds to the random sampling probability to get the same number of samples, which is .
RSGE (Frank, 1978): This algorithm requires performing breadth first search (BFS) on each randomly selected vertices. BFS requires knowing all edges (neighbors) of a node for the next step, which requires edge queries. To ensure the fixed budget , we end the traversal when the number of distinct edge queries reaches the fixed budget .
BFSE (Chazelle, Rubinfeld and Trevisan, 2005): This algorithm samples a subgraph and observes it completely. This requires labeling all the pairs of records in the sampled sub-graph. To ensure same budget , the sampled sub-graph has approximately vertices.
Remark: To the best of our knowledge there have been no experimental evaluations of the two algorithms of Frank (1978) and Chazelle, Rubinfeld and Trevisan (2005) in the literature. Hence, our results could be of independent interest in themselves.
We compute the relative error (RE), calculated as
for each of the estimators, for different values of the budget . We plot the RE for each of the estimators, over a range of values of , summarizing the results in Figure 5.
All the estimators require querying pairs of records compared to labeled ground truth data for whether they are a match or a non-match. As already mentioned, in the absence of full labeled ground truth data, we can use a supervised classifiers such as SVMs as a proxy, assuming at least some small amount of labeled data exists. By training an SVM, we can use this as a proxy for labeled data as well. We use such a proxy in the Syrian data set because we are not able to query every pair of records to determine whether they are true duplicates or not.
We start with the three data sets where fully labelled ground truth data exists. For LSHE, we compute the estimation accuracy using both the supervised SVM (Section 5) as well as using the fully labelled ground truth data. The difference in these two numbers quantifies the loss in estimation accuracy due to the use of the proxy SVM prediction instead of using ground truth labeled data. In our use of SVMs, we take less than 0.01 of the total number of the possible record pairs as the training set.
4.2 Evaluation Results
In this section, we summarize our results regarding the aforementioned evaluation metrics by varying the sample sizeon the three real data sets (see Figure 5). We notice that for the CD and Voter data sets, we cannot obtain any reliable estimate (for any sample size) using PRSE. Recall that plain random sampling almost always samples pairs of records that correspond to non-matches. Thus, it is not surprising that this method is unreliable because sampling random pairs is unlikely to result in a duplicate pair for entity resolution tasks. Even with repeated trials, there are no edges in the specified sampled pairs of records, leading to an undefined value of . This phenomenon is a common problem in random sampling estimators over sparse graphs. Almost all the sampled nodes are singletons. Subsampling a small sub-graph leads to a graph with most singleton nodes, which leads to a poor accuracy of BFSE. Thus, it is expected that random sampling will perform poorly. Unfortunately, there is no other baseline for unbiased estimation of the number of unique entities.
From Figure 5 observe that the RE for proposed estimator LSHE is approximately one to two orders of magnitude lower than the other considered methods, where the y-axis is on the log-scale. Undoubtedly, our proposed estimator LSHE consistently leads to significantly lower RE (lower error rates) than the other three estimators. This is not surprising from the analysis shown in section 3.5. The variance of random sampling based methodologies will be significantly higher.
Taking a closer look at LSHE, we notice that we are able to efficiently generate samples with very high values of (see Table 2). In addition, we can clearly see that LSHE achieves high accuracy with very few samples. For example, for the CD data set, with a sample size less than of the total possible pairs of records of the entire data set, LSHE achieves RE. Similarly, for the Voter data set, with a sample size less than of the total possible pairs of records of the entire data set, LSHE achieves RE.
Also, note the small values of and parameters required to achieve the corresponding sample size. and affect the running time, and small values indicate significant computational savings as argued in section 3.6.2
As mentioned earlier, we also evaluate the effect of using SVM prediction as a proxy for actual labels with our LSHE. The dotted plot shows those results. We remark on the results for LSHE + SVM in section 5.
5 Documented Identifiable Deaths in the Syrian Conflict
In this section, we describe how we estimate the number of documented identifiable deaths for the Syrian data set. As noted before, we do not have ground truth labels for all record pairs, but the data set was partially labeled with 40,000 record pairs (out of 63 billion). We propose an alternative (automatic) method of labeling the sample pairs, which is also needed by our proposed estimation algorithm. More specifically, using the partially labeled pairs, we train an SVM. In fact, other supervised methods could be considered here, such as random forests, Bayesian Adaptive Regression Trees (BART), among others, however, given that SVMs perform very well, we omit such comparisons as we expect the results to be similar if not worse.
To train the SVM, we take every record pair and generate -grams representation for each record. Then we spilt the partially labeled data into training and testing sets, respectively. Each training and testing set contains a pair of records . In addition, we can use a binary label indicating whether the record pair is a match or non-match. That is, we can write the data as as the set difference of the -grams of the strings of pairs of records and , respectively. Observe that if the and is labelled as match and otherwise. Next, we tune the SVM hyper-parameters using 5-fold cross-validation, and we find the accuracy of SVM on the testing set was 99.9%. With a precision as high a 0.99, we can reliably query an SVM and now treat this as an expert label.
To understand the effect of using SVM prediction as a proxy to label queries in our proposed unique entity estimation algorithm, we return to observing the behavior in figure 5. We treat the LSHE estimator on the other three real datasets as our baselineand compare to LHSE with the SVM component, where the SVM prediction replaces the querying process (LSHE +SVM). Observe in Figure 5, that the plot for LSH (solid black line) and LSH+SVM (dotted black line) overlap indicating a negligible loss in performance. This overlap is expected given the high accuracy (high precision) of the SVM classifier.
5.1 Running Time
We briefly highlight the speed of the sampling process since it could be used for on the fly or online unique entity estimation. The total running time for producing 450,000 sampled pairs (out of a possible 63 billion) used for the LSH sampler (Section 3.6.1) with and is 127 seconds. On the other hand, it will take approximately take 8 days to compute all pairwise similarities across the 354,996 Syrian records. Computing all pairwise similarities is the first prerequisite for any known adaptive sampling over pairs based on similarity – if we are not using the proposed LSH sampler. (Note: there are other ways of blocking (Christen, 2012), however as mentioned in Section 3.6.4 they are mostly deterministic (or rule-based) and do not provide an estimator of the unique entities.)
5.2 Unique Number of Documented Identifiable Victims
In the Syrian dataset, with 354,996 records and possibly 63 billion () pairs, our motivating goal was to estimate the unique number of documented identifiable victims. Specifically, in our final estimate, we use 452,728 sampled pairs that are given by LSHE+SVM (, ) which has approximately based on the subset of labeled pairs. The sample size was chosen to balance the computational runtime and the value of . Specifically, one wants high values of (for a resulting low variance of our estimate) and, to balance running time, we limit the sample size to be around the total number of records , to ensure a near linear time algorithm. (Such settings are determined by the application, but as we have demonstrated they work for a variety of real entity resolution data sets). We chose the SVM as our classifier to label the matches and non-matches. The final unique number of documented identifiable victims in the Syrian data set was estimated to be 191,874, very close to the 191,369 documented identifiable deaths reported by HRDAG 2014, where their process is described in Appendix A.
5.3 Effects of , , on sample size and
In this section, we discuss the sensitivity of our proposed method as we vary the choice of , , the sample size , and .
We want both as well as the number of samples to be , for the process to be truly sub-quadratic. For accuracy, we want high values of , because the variance is monotonic in which is also the recall of true labeled pairs. Thus, there is a natural trade-off. If we sample more, we get high but more computations.
and are the basic parameters of our sampler (Section 3.6.1), which provide a tradeoff between the computationally complexity and accuracy. A large value of makes the buckets sparse exponentially), and thus, fewer pairs of records are sampled from each table. A large value of increases the repetition of hash tables (linearly), which increases the sample size. As already argued, the computational cost is .
To understand the behavior of , , , and the computational cost, we perform a set of experiments on the Syrian dataset. We use n-gram of 2—5, we vary L from 5–100 by steps of 5 and K takes values 15,18,20,23,25,28,30,32,35. For all these combinations, we then plot the recall (also the value of ) and the reduction ratio (RR), which is the percentage of computational savings. A 99% reduction ratio means that the original space has been reduced to only having to look at a only 1% of total sampled pairs. Figure 6 shows the tradeoffs between reduction ratio and recall (or value of ). Every dot in the figures is one whole experiment.
Regardless of the n-gram variation from 2–5, the recall and reduction ratio (RR) are close to 1 as illustrated in Figure 6. We see that an n-gram of 3 overall is most stable in having a recall and RR close to 0.99. We observe that and gives a high recall of around 83% with less than half a million pairs (out of 63 billion possible) to evaluate ().
Motivated by three real entity resolution tasks and the ongoing Syrian conflict, we have proposed a general, scalable algorithm for unique entity estimation. Our proposed method is an adaptive LSH on the edges of a graph, which in turn estimates the connected components in sub-quadratic time. Our estimator is unbiased and has provably low variance in contrast to other such estimators for unique entity estimation in the literature. In experimental results, it outperforms other estimators in the literature on three real entity resolution data sets. Moreover, we have estimated the number of documented identifiable deaths to be 191,874, which very closely matches the 2014 HRDAG estimate, completed by hand-matching. To our knowledge, we have the first estimate for the number of documented identifiable deaths with a standard error associated with such an estimate. Our methods are scalable, potentially bringing impact to the human rights community, where such estimates could be updated in near real time. It could lead to further impact in public policy and transitional justice in Syria and other areas of conflict globally.
Acknowledgements: We would like to thank the Human Rights Data Analysis Group (HRDAG) and specifically, Megan Price, Patrick Ball, and Carmel Lee for commenting on our work and giving helpful suggestions that have improved the methodology and writing. We would also like to thank Stephen E. Fienberg and Lars Vilhuber for making this collaboration possible. PhD student Chen is supported by National Science Foundation (NSF) grant number 1652131. Shrivastava’s work is supported by NSF-1652131 and NSF-1718478. Steorts’s work is supported by NSF-1652431, NSF-1534412, and the Laboratory for Analytic Sciences (LAS). This work is representative of the author’s alone and not of the funding organizations.
Appendix A Syrian Data Set
In this section, we provide a more detailed description about the Syrian data set. As mentioned in section 1.2, via collaboration with the Human Rights Data Analysis Group (HRDAG), we have access to four databases. They come from the Violation Documentation Centre (VDC), Syrian Center for Statistics and Research (CSR-SY), Syrian Network for Human Rights (SNHR), and Syria Shuhada website (SS). Each database lists each victim killed in the Syrian conflict, along with identifying information about each person (see Price et al. (2013) for further details).
Data collection by these organizations is carried out in a variety of ways. Three of the groups (VDC, CSR-SY, and SNHR) have trusted networks on the ground in Syria. These networks collect as much information as possible about the victims. For example, information is collected through direct community contacts. Sometimes information comes from a victim’s friends or family members. Other times, information comes from religious leaders, hospitals, or morgue records. These networks also verify information collected via social and traditional media sources. The fourth source, SS, aggregates records from multiple other sources, including NGOs and social and traditional media sources (see http://syrianshuhada.com/ for information about specific sources).
These lists, despite being products of extremely careful, systematic data collection, are not probabilistic samples (Price, Gohdes and Ball, 2015; Price and Ball, 2015a, b; Price et al., 2014). Thus, these lists cannot be assumed to represent the underlying population of all victims of conflict violence. Records collected by each source are subject to biases, stemming from a number of potential causes, including a group’s relationship within a community, resource availability, and the current security situation. Although it is beyond the scope of this paper, final analyses of these sources must appropriately adjust for such biases before drawing conclusions about patterns of violence.
a.1 Syrian Handmatched Data Set
We describe how HRDAG’s training data on the Syrian data set was created, which we use in our paper. We would like to note that we only use a small fraction of the training data for two reasons. The first is so that we can see how close our estimate is in practice to their original handmatched estimate, given that such a large portion of the data was handmatched. Second, we want to avoid using too much training data to avoid biases and also because such handmatching efforts would not be possible moving forward as the Syrian conflict continues, and our small training data set is meant for one moving forward in practice.
First, all documented deaths recorded by any of the documentation groups were concatenated together into a single list. From this list, records were broadly grouped according to governorate and year. In other words, all killings recorded in Homs in 2011 were examined as a group, looking for records with similar names and dates.
Next, several experts review these “blocks”, sometimes organized as pairs for comparison and other times organized as entire spreadsheets for review. These experts determine whether pairs or groups of records refer to the same individual victim or not. Pairs or groups of records determined to refer to the same individual are assigned to the same “match group.” All of the records contributing to a single “match group” are then combined into a single record. This new single record is then again examined as a pair or group with other records, in an iterative process.
For example, two records with the same name, date, and location may be identified as referring to the same individual, and combined into a single record. In a second review process, it may be found that that record also matches the name and location, but not date, of a third record. The third record may list a date one week later than the two initial records, but still be determined to refer to the same individual. In this second pass, information from this third record will also be included in the single combined record.
When records are combined, the most precise information available from each of the individual records is kept. If some records contain contradictory information (for example, if records A and B record the victim as age 19 and record C records age 20) the most frequently reported information is used (in this case, age 19). If the same number of records report each piece of contradictory information, a value from the contradictory set is randomly selected.
Three of the experts are native Arabic speakers; they review records with the original Arabic content. Two of the experts review records translated into English. These five experts review overlapping sets of records, meaning that some records are evaluated by two, three, four, or all five of the experts. This makes it possible to check the consistency of the reviewers, to ensure that they are each reaching comparable decisions regarding whether two (or more) records refer to the same individual or not.
After an initial round of clustering, subsets of these combined records were then re-examined to identify previously missed groups of records that refer to the same individual, particularly across years (e.g., records with dates of death 2011/12/31 and 2012/01/01 might refer to the same individual) and governorates (e.g., records with neighboring locations of death might refer to the same individual).
Appendix B Unique Entity Estimation Proofs
First, we introduce four indicators. First, let denote every 2-vertex clique in (recall that is the original graph and is the observed one):
Second, let denote every 3-vertex clique in :
Third, let denote every 3-vertex clique in :
Finally, let denote every 3-vertex clique in :
We now prove that our estimator is unbiased. Consider
We now turn to deriving the variance of our proposed estimator, showing that
Next, we replace by and by simplifying equation 17, we find
Note that the covariance of
When , since and are mutually exclusive, . Otherwise when , and are independent and . Similarly,
It then follows that
b.3 Variance Monotonicity
We now prove the monotonicity of the variance of our estimator.
is monotonically decreasing when increases in range .
First order derivative of is negative when .