Record linkage (de-deduplication or entity resolution) involves identifying duplicate records in large, noisy databases . Traditional linkage methods that directly link records to one another become computationally infeasible as the number of records grows [3, 18], and thus, it is increasingly common for researchers to treat linkage as a clustering task, in which latent entities are associated with one or more noisy database records, and the inferential goal is to identify the latent entity underlying each observed database record [14, 15, 16]. Although there are many probabilistic, generative models for clustering — of which several have been used for record linkage — the theoretical properties, such as performance bounds, have such not been critically assessed.
The work of [14, 15, 16] attempted to deconstruct distorted data by latent variable mixture models. The authors achieved this by clustering similar records to a hypothesized latent entity for each observed record, where their linkage structure kept track of which latent entity belongs to the same observed records. This is modeled through a latent variable mixture model with a distortion process on the data (sections 2.1 and 2.2). Thus, the main goal is to be able to take distorted data and uncover the underlying structure in the presence of noise. This is similar to signal processing, where a signal is received in the presence of some noise and often the goal is to understand if the underlying true (latent) signal can be recovered. We develop performance bounds under the framework proposed by [14, 15, 16].
We provide an upper bound on the Kullback-Leibler (KL) divergence between models with different linkage structures and use it to provide a lower bound on the minimum probability of misclassifying a latent entity. More precisely, under the categorical model of [15, 16] and string model of , we find the minimum probability of getting a latent entity incorrect. We make connections to Kolchin partition (KP) models , along with extending our overall KL bounds in general. Finally, we explore how our bounds perform in practice and describe their user practicality.
1.1 Prior work
Bayesian methods and latent variable modeling have become recently popularized in record linkage models. A major advantage of Bayesian methods is their natural handling of uncertainty quantification for the resulting estimates. The first notion of understanding a distortion process for record linkage is the hit-miss-model, which uses a binary distortion process on the data. Within the Bayesian paradigm, most work has focused on specialized approaches related to linking two files [6, 17]. These contributions, while valuable, do not easily generalize to more than two files or to de-duplication within a single file. For a review of recent development in Bayesian methods, see .
The work of [15, 16] recently introduced a Bayesian model that simultaneously handled record linkage and de-duplication for categorical data. Their approach allowed for natural uncertainty quantification during analysis and post-processing. Finally,  recently extended the work of  to both categorical and string valued data using a coreference matrix or a partitioning approach. In the later paper, it was shown that the coreference matrix is a special case of the linkage structure, thus, we work with the linkage structure. Another advantage of  and similar approaches is that their linkage structure is amenable to an efficient MCMC inference algorithm. These models have become practically relevant as they have been shown to perform well on a variety of applications, including official statistics and medical data. In addition, extensions have been made to more general framework of models [12, 17, 19], which is incorporated into our framework in section 4.
Given the noted distortion process, deriving performance bounds seems natural to recover the underlying structure. For example, much work has been done in information theory for subset selection in graphical model selection, signal de-noising, compressive sensing, and others.
In compressed sensing, one question recently addressed in , was directly measuring the part of the data from sounds and images that will not be thrown away. We make a connection here, as in record linkage we wish to take noisy, distorted data and recover this under the KL divergence. Divergence functions by [13, 7] are useful in many applications including recent statistical applications of clustering, as done in  for hard clustering to obtain optimal quantization by minimizing the Bregman divergence (motivated by rate distortion theory).
The rest of this paper proceeds as follows. Two recent record linkage models are given in section 2; Section 2.1 and section 2.2 review these models. Section 3 derives the respective performance bounds, while section 4 extends our general result to a wider class of models. Section 5 shows performance of the bounds in practice, discusses our findings and user practicality. Section 6 discusses future work.
2 Bayesian Record Linkage
We assume two Bayesian record linkage models, one dealing with categorical data and the other dealing with both categorical and noisy string data, such as names, addresses, etc. The first is that of [15, 16], and the second is that of .
2.1 Categorical Bayesian Record Linkage
We review common notation to both models.111For a toy example of the record linkage process, see the Supplementary Material. Let represent the data, with databases, indexed by . The th list has observed records, indexed by . Each record corresponds to one of latent entities, indexed by Assume without loss of generality. Each record or latent entity has values on fields, indexed by , and are assumed be categorical and the same across all records and entities [15, 16]. denotes the number of possible categorical values for the th field.
In both models, denotes the observed value of the th field for the th record in the th list, and denotes the true value of the th field for the th latent entity. Then denotes the latent entity to which the th record in the th list corresponds, i.e., and represent the same entity if and only if . Then denotes the collectively. Distortion is denoted by , where denotes the indicator function. As usual, represents the indicator function (e.g., is 1 when the th field in record in file has the value ), and let denote the distribution of a point mass at (e.g., ). The model of [15, 16] is:
where MN denotes the Multinomial distribution and are all known. Guidance for the hyper-parameters and a justification of the (discrete) uniform prior are given in [15, 14, 16]. Model 2.1 assumes that different records are independent conditional on the deeper variables of the model. Moreover, it assumes the same conditional independence of different fields for the same record. Finally, observe that record linkage and de-duplication are both simply a question of whether , where for record linkage and for de-duplication.
2.2 Empirical Bayesian Record Linkage
The work of  assumes fields are string-valued, while fields are categorical, where is the total number of fields. They assume an empirical Bayesian distribution on the latent parameter. For each , let denote the set of all values for the th field that occur anywhere in the data, i.e., and let equal the empirical frequency of value in field Let denote the empirical distribution of the data in the
th field from all records in all databases combined. So, if a random variablehas distribution , then for every , . Hence, let be the prior for each latent entity . The distortion process changes such that
where is a fixed normalizing constant corresponding to an arbitary distance metric . Denote this distribution by . The model becomes
where all distributions are also independent of each other; assume that
are assumed known. This framework was shown to work well in applications and simulation studies, however, it was quite sensitive to the choice of the hyperparameters. This method beat supervised methods, such as random forests when the amount of training data input into the supervised methods was.
3 Performance Bounds of Record Linkage
Recall the connection to KL divergence in the sense that for any two distributions and , the maximum power for testing versus is Hence, a low value of means that we need many samples to distinguish from A natural question is how does changing (latent entity) or (linkage structure) change the distribution of (observed records)? We search for both meaningful upper and lower bounds, since an upper bound will say that and are never more than so far apart, whereas a lower bound says how easy it is to tell and apart. Moreover, we investigate how well can we recover (latent entity) and (linkage structure) from (data).
Assuming the conditions of [15, 14], let We know that are all independent given under both This implies that We first provide a theorem under the model of , which assumes categorical data and a hierarchical model. In Theorem 1, we find the minimum probability of getting a latent entity wrong. Moreover, we are able to say that with growing distortion of the data, there is no difference between two latent entities and the bound becomes infinite and non-informative in this case. Next, under the model of  we provide a general theorem, which assumes both categorical and noisy text data. This theorem provides an upper bound on the KL divergence of arbitrary distributions and .
3.1 Kullback-Leibler Divergence under Categorical Data
We use Fano’s inequality  to bound the probability of misclassification, as a function of the KL divergence between and , as defined in the previous section. Assume that and are two distinct linkage structures that correspond to the same latent entity Let be the cardinality of , i.e. .
This result finds an upper bound on the KL divergence and a lower bound for the probability that model 2.1 gets the linkage structure incorrect. Let
The KL divergence is bounded above by That is, .
The minimum probability of getting a latent entity wrong is
That is, as the latent entities become more distinct, increases. On the other hand, as the latent entities become more similar,
Consider Theorem 1 (i). Suppose Then If instead then The lower bound is only informative when We have more information when the latent entities are separated.
To show this, we simply apply Pinsker’s inequality, where for all : ∎
It follows from equation 3 that
It directly follows that
If , then
This proves (i). We now prove (ii). Using Fano’s inequality , the minimum probability of getting a latent entity wrong is where is the cardinality of , i.e. . As the latent entities become more distinct, increases. On the other hand, as the latent entities become more similar, ∎
3.2 KL Divergence Bounds for String and Categorical Data
We now consider and under  for both categorical and noisy string data. Recall that tunes the amount of distortion as defined in equation 2.2. Recall that denotes any arbitrary distance metric between an observed string and a latent string as seen in equation 2.2, and is a fixed normalizing constant corresponding to the distance metric
In Theorem 4, for any distinct linkage structures, the minimum probability of getting a latent entity wrong is governed by a lower bound, which is growing at a rate
that is determined by the moment generating function of the distances between an observed string in data and a latent string.
Assume data and distributions defined in section 3. Assume two distinct linkage structures, denoted by
There is an upper bound on the KL divergence between any given by that is
and is the cardinality of .
The proof of Theorem 4 can be found in the Supplementary Material.
4 Kolchin Partition Models
The models in sections 2.1 and 2.2 assume discrete uniform priors on the linkage structure. We extend this to a more general class of models from Bayesian nonparametrics known as KP models . Special cases include the work of [19, 17, 12]. We provide notation, examples, and then provide a general theorem.
The prior structure on can instead be viewed on the set of labelings. Specifically, let denote the partition of the observed records determined by , and denote the set containing all the possible partitions of the observed records. Then a distribution on the sample labels induces a distribution on That is, matches and duplicates are completely specified given the knowledge of , which is invariant with respect to the labelings of the partition blocks.
4.1 Special Cases of KP Models
We give some special cases of KP models that extend the class of models that we consider.
4.2 A Uniform Prior on the Label Space
Let denote the partition identified by and let denote the number of distinct entities considering all the observed records, i.e. the number of blocks of the partition. One has
different labelings which identify the same partition . Then
Note also that where is the Stirling number of second kind, that is the number of possible partitions of the records into non empty sets, which implies
4.3 The Uniform Prior on the Partition Space
Assuming one database,  focused mainly on the partitions of the records induced by and proposed a flat prior on the partition space, that is a prior which assigns equal probability to each different partition of the observed records. Assume that
where is the -th Bell number. In terms of partitions, the prior used by  can be written as and . This prior is also a special case of a KP model.
We now provide a general theorem, which gives a relationship regarding priors (or partitions or blocks) of KP models to our KL divergence bounds. Suppose the prior on the linkage structure can be represented as a KP model . Then a wide class of priors is able to be considered and compared.
4.4 Microclustering and Record Linkage
There has been early work in Bayesian nonparametrics to push forward record linkage. The work of [9, 19] recently pointed out that most clustering tasks assume the cluster sizes grow linearly with the number of the data points. Such examples include infinitely exchangeable clustering models, including finite mixture models, Dirichlet process (DP) mixture models, and Pitman–Yor process (PYP) mixture models. However, in record linkage, such an assumption is undesirable since linkage methods require models that yield clusters whose sizes grow sublinearly with the total number of data points (records). Due to this, 
defined the microclustering property as well as a new model exhibiting such growth, where their models outperformed or performed as well as the PYP and DP in terms of standard record linkage evaluation metrics on data for official statistics, medical data, and human rights data. Furthermore, the authors proved that one of their models satisfies the microclustering property under very weak assumptions. We refer to their paper for further details.
One insight of this paper was the fact that their class of microclustering models considered can always be written as a KP model. Combining this clustering approach with the likelihood in equation 2.1 or 2.2 immediately allows one to perform record linkage inference. Furthermore, Theorems 1 and 4 are immediately satisfied since the prior on the linkage structure can be represented as a KP model.
5 Simulation Study and Discussion
We consider how the bounds in Sections 3.1 and 3.2 hold for two simulated experiments. In our experiments (Experiment I and Experiment II), synthetic categorical data are generated according to either model 2.1 or 2.2 using the parameters shown in Table 1 and 2, respectively. In order to consider a realistic set of strings for , we consider the set of 20 most popular female baby names from 2014, according to the United States Census. Then for the distance , we consider the generalized Levenshtein edit distance.
We then generate both categorical and string records according to either model 2.1 or 2.2. For each experiment, we vary exactly one of the parameters to demonstrate its impact of the linkage error rate . We choose the other values such that the performance is neither extremely low nor extremely high. We set the distortion parameter to the same value for each , i.e. denotes a distortion probability of 0.6 for every field. 0.0 to 1.0 means we started with for all and swept the values until for all . Recall is the number of fields, and thus the maximum value of . We also set each to the same value, i.e. denotes for all and all . This further implies each field takes on exactly values in order for
to be a valid probability distribution.
|Fig. 1(a)||10 to 500||0.6||3||0.1|
|Fig. 1(b)||100||0 to 1||3||0.1|
|Fig. 1(c)||100||0.6||1 to 8||0.25|
|Fig. 2(a)||100 to 500||0.6||1||1.0|
|Fig. 2(b)||100||0.2 to 1||1||1.0|
|Fig. 2(c)||100||0.6||1 to 10||1.0|
|Fig. 2(d)||100||0.6||1||0 to 2|
We compare the bound in Theorem 1 to two record linkage algorithms [15, 16, 14]. The first is an exact sampler, which samples directly from . The second is a more realistic Gibbs sampler with empirically motivated priors proposed by . We run the Gibbs sampler for 10,000 iterations on all experiments to ensure proper mixing. There is some difficulty in comparing to , as there are multiple equally correct modes due to arbitrary re-orderings of the latent individuals and corresponding linkage structure . Even though the Gibbs sampler may infer the correct latent individuals and linkage structure, because the ordering is arbitrary, it is unlikely that . To avoid such an issue of label switching, we fix during the sampling process.
Specifically, we compare the bound to the empirical error rate of the Gibbs sampler proposed by . In order to compute the empirical probability , we hold fixed during Gibbs sampling to ensure errors in are not due to arbitrary changes in the ordering of the labels of . In addition, we compare the linkage error rate to an exact sampler, which samples directly from .
Results of Experiment I
In Figures 2 (a)-(d) we vary the number of records , distortion parameter , number of fields and number of values each field takes , respectively. The empirical results demonstrate Theorem 1 captures the dependence between the error rate and the all relevant latent parameters , and . Specifically, linking records becomes more difficult as increases, the distortion parameter increases, the number of fields decreases or the number of values each field can take decreases. The bound nicely captures the logarithmic increase in error with respect to in Figure 2 (a), which gives hope for linking records in very large databases. Other terms appear to be when not near extreme error values, implying low noise and a larger feature space are essential to performing high quality record linkage.
Results of Experiment II
Figures 3 (a)-(d) show Theorem 4 is tight to the true performances on string data when varying , , number of string fields and , respectively. As expected, and similarly to the categorical results, linking records becomes more difficult as increases, the distortion parameter increases and the parameter decreases. The effects of parameter variation is less noticeable in the string experiments due to the fact that linking string fields is easier than ones that have been anonymized, i.e., categorical fields.
The Gibbs sampler (blue diamonds) performs almost as well as the exact sampler (grey circles). In fact, due to the conditional entropy version of Fano’s inequality and the fact that , any Gibbs sampler cannot perform better in expectation than an exact sampler. Thus, we believe the gap between the bound (gold squares) and the exact sampler does not necessarily indicate the existence of a better algorithm, but perhaps only some unnecessary slack due to the application of Pinsker’s and then reverse Pinsker’s inequalities.
5.1 Discussion of Results
As illustrated in Theorems 1 and 4 we have derived an upper bound on the KL divergence as well as lower bounds for misclassifying a latent entity. In Theorem 1 (i), we showed that the latent entities become more distinct when is increasing. This is in contrast to when gets closer to 0, since then the latent entities become more similar. In Theorem 1 (ii), we showed that as the distortion parameter then the upper bound is infinite. In practice, as illustrated in , the latent entities are difficult to distinguish when the amount of distortion is more than 5% at every field value. Thus, this corresponds to when the bound is too loose. On the other hand, as the latent entities become more separated.
We discuss how separated the latent entities are under choices of and , providing guidance to the user in this setting given our simulation results. As practical guidance when the distortion is between 0 to at every feature value, the latents will be more separated and the bound will be be loose. On the other hand, as increases, the bound becomes tighter. The choice of can be made using subjective information about the underlying data and tuned using the hyper-parameters . (See [15, 14] for choosing such values). On the other hand, we can see that for more realistic values of the distortion parameter in Figure 2 (a), (b), and (d) , the bound is quite loose when the distortion parameter is large. Thus, a loose bound here is warranted due to the amount of noise or model-misspecification being placed into the model as well as the fact that all of the fields being used are categorical. Such results match the intuition given in .
In Theorem 4 (ii), we derived a lower bound where the minimum probability of getting a latent entity wrong is controlled by which is determined by the moment generating function of the distances between an observed string and a latent string. This bound has the same type of form as the bound in Theorem 1, however, since we now have string-valued data, we see that the minimum probability of getting a latent entity wrong is dominated by the string-valued variables and specifically, the distances functions used and the constants used. In comparison to , this completely matches up with the sensitivity that was seen to the choice of the distance functions as well as the choice of as this will completely dominate the posterior, and hence, the ability to tell latent entities apart under this posterior.
In practice, the driving force of the tightness of the bound is , the steepness parameter of the string distribution in equation 2. As increases, it is less likely for a string-valued record’s features to be distorted to values that are far from that of their latent feature values. This is verified in Figure 3(c), where linkage error decreases as increases. The work of  gave practical choices for , which were [0,2]. Similarly, we can speak to the tightness of , which relies on the distortion parameter not being too small in practice, as verified in Figure 3(b). In terms of the bounds found in Theorem 1 and 2, the empirical Gibbs sampler has tight bounds in almost all situations, except when the number of features is large, is too small, or is too small (and similarly for ). This coincides with exactly what we would expect in practice from the real experiments of .
For all applications in both categorical and string data, we expect the bounds to be as loose in practice (corresponding to easier record linkage), when the distortion parameter is small () and when the the number of fields is large () or the number of values that each field can take, , increases (this will be application specific). Finally, the bounds should be tighter, corresponding to more difficult record linkage, as the total number of records N increases (see Figure 2). These parameter values match almost exactly with two real data experiments (corresponding ranges of parameters) as well as a simulation study from [15, 16].
6 Future Directions
First, we have derived general performance bounds for record linkage, making connections to KP models and other related Bayesian models. More specifically, we have drawn connections to a wide class of models from Bayesian record linkage. Second, our bound for the categorical Bayesian record linkage model is easily interpretable and matches the intuition of the generative model. Third, our bound for the categorical and noisy string model, takes a similar form to that of the categorical model. We are also able to interpret this bound in a way that aligns with the interpretations [14, 15, 16] as well as show the practicality of our bounds to the aforementioned papers. More specifically, our bounds are empirically loose for categorical data, which is not unexpected since there is little information available to match on. This contrasts the empirical tight bounds for both categorical and noisy string data. As illustrated in our experiments, with just one string variable, our bounds become much tighter, and as the number of strings increases, the bound becomes more tighter when compared to exact and Gibbs sampling.
In addition, there has been early work in Bayesian nonparametrics to push forward record linkage. The work of  pointed out that most clustering tasks assume cluster sizes grow linearly with the size of the database. Such examples include infinitely exchangeable clustering models, including finite mixture models, Dirichlet process mixture models, and Pitman–Yor process mixture models, which all make this linear growth assumption. However, in record linkage such an assumption is undesirable since linkage methods require models that yield clusters whose sizes grow sublinearly with the total number of data points. This observation led the authors to define the microclustering property as well as a new model exhibiting such growth. Our work has been able to provide bounds for the aforementioned work since the prior consider is a KP model. In future work it would be helpful to try and draw connections between those proposed in  and [14, 15, 16] in order to generalize such bounds and provide tighter bounds using conditional entropy or other sophisticated bounding methods.
We thank David Choi, David Dunson, David Banks, and the reviewers for improving the ideas that led to publication of this paper. This work was supported in part by NSF grants SES-1534412 and SES-1131897, DARPA FA8750-12-2-0324 and FA8750-14-2-0244.
-  Arindam Banerjee, Srujana Merugu, Inderjit S Dhillon, and Joydeep Ghosh. Clustering with bregman divergences. The Journal of Machine Learning Research, 6:1705–1749, 2005.
-  Daniel Berend, Peter Harremoës, and Aryeh Kontorovich. Minimum KL-divergence on complements of balls. IEEE Transactions on Information Theory, 60(6):3172–3177, 2014.
-  P. Christen. Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, 2012.
-  J. Copas and F.J. Hilton. Record linkage: Statistical models for matching computer records. Journal of the Royal Statistical Society, Series A, 153(3):287–320, 1990.
-  David L Donoho. Compressed sensing. Information Theory, IEEE Transactions on, 52(4):1289–1306, 2006.
-  R. Gutman, C. Afendulis, and A. Zaslavsky. A Bayesian procedure for file linking to analyze end- of-life medical costs. Journal of the American Statistical Association, 108(501):34–47, 2013.
-  Solomon Kullback and Richard A Leibler. On information and sufficiency. The Annals of Mathematical Statistics, 22(1):79–86, 1951.
-  B. Liseo and A. Tancredi. Some advances on Bayesian record linkage and inference for linked data. Technical Report, 2013.
-  Jeffrey Miller, Brenda Betancourt, Abbas Zaidi, Hanna Wallach, and Rebecca Steorts. The Microclustering Problem: When the Cluster Sizes Don’t Grow with the Number of Data Points. NIPS Bayesian Nonparametrics: The Next Generation Workshop Series, 2015.
-  Jim Pitman. Combinatorial Stochastic Processes: Ecole D’Eté de Probabilités de Saint-Flour XXXII-2002. Springer, 2006.
-  Vyacheslav Valer’evich Prelov and Edward C. van der Meulen. Mutual information, variation, and fano’s inequality. Problems of Information Transmission, 44(3):185–197, 2008.
-  Mauricio Sadinle. Detecting duplicates in a homicide registry using a Bayesian partitioning approach. The Annals of Applied Statistics, 8(4):2404–2434, 2014.
-  Claude E Shannon. A note on the concept of entropy. Bell System Tech. J, 27:379–423, 1948.
-  R. C. Steorts. Entity resolution with empirically motivated priors. Bayesian Analysis, 10(4):849–875, 2015.
-  R. C. Steorts, R. Hall, and S. E. Fienberg. SMERED: A Bayesian approach to graphical record linkage and de-duplication. Journal of Machine Learning Research, 33:922–930, 2014.
-  R. C. Steorts, R. Hall, and S. E. Fienberg. A Bayesian approach to graphical record linkage and de-duplication. Journal of the American Statistical Society, In press.
-  A. Tancredi and B. Liseo. A hierarchical Bayesian approach to record linkage and population size problems. Annals of Applied Statistics, 5(2B):1553–1585, 2011.
-  W. E. Winkler. Overview of record linkage and current research directions. Technical report, U.S. Bureau of the Census Statistical Research Division, 2006.
-  Giacomo Zanella, Brenda Betancourt, Jeffrey W Miller, Hanna Wallach, Abbas Zaidi, and Rebecca Steorts. Flexible models for microclustering with application to entity resolution. In Advances in Neural Information Processing Systems, pages 1417–1425, 2016.
Appendix A Example of the Record Linkage Process
We provide a toy illustration of the general record linkage process in figure 4. Consider three databases and the notation already introduced, where here . Suppose the “population” entities have four members, where name and address are stripped for anonymity and they are listed by state, age, and sex, as is often the case with de-identified data.
For instance, assume the true latent entity vectoris known:
The observed records are given in three separate databases (k=3), which would combine into a three-dimensional array. We write this here as three two-dimensional arrays for notational simplicity:
Here, for the sake of keeping the illustration simple, only age is distorted. Comparing to , the intended linkage and distortions are
In this linkage structure, every entry of with a value of 2 means that some record from refers to the latent entity with attributes “SC, 73, F." Here, the age of this entity is distorted in all three databases, as can be seen from . (Note that , like , is also really a three-dimensional array.) Looking at and , we see that there is only a single record in either list that is distorted, and it is only distorted in one field. In list 2, however, every record is distorted, though only in one field.
Appendix B Derivation of Theorem 1
We briefly restate the theorem, and then provide its derivation.
Assume data and distributions Assume two distinct linkage structures, denoted by
There is an upper bound on the KL divergence between any given by that is
and is the cardinality of .
We first prove (i). Consider
Suppose that Equation 6 implies that
We now consider and by equation 7, we find
Then by equation 8, it is clear that
Now assume that two field attributes are different. That is, suppose there exists an Then we assume that there exists a such that By the reverse triangle inequality, for any
Equation B in turn implies that
is the moment generating function of (evaluated at c), where This implies that
Then by reverse Pinker’s inequality , we can write
Thus, (i) is established. Using Fano’s inequality, we find that
We have established that for any , the minimum probability of getting a latent entity wrong is governed by the constant That is, the lower bound grows as goes to and its rate of growth is determined by the moment generating function of the distances. We have now established (ii). ∎