Record linkage (de-duplication or entity resolution) is the process of merging noisy databases to remove duplicate entities. While record linkage removes duplicate entities from such databases, the downstream task is any inferential, predictive or post-linkage task on the linked data. In this paper, we propose a joint model for the record linkage and the downstream task of linear regression. Our proposed model can link records over an arbitrary number of databases (lists or files). We assume there is duplication within each database, known as “duplicate detection.” Our record linkage model can be expressed as a random partition model, which leads to a large family of distributions. Next, we jointly model the record linkage task and the downstream task (linear regression), which allows for the exact propagation of the record linkage uncertainty into the downstream task. Crucially, this generates a feedback propagation mechanism from the proposed Bayesian record linkage model into the downstream task of linear regression. This feedback effect is essential to eliminate potential biases that can jeopardize resulting inference in the downstream task. We apply our methodology to multiple linear regression, and illustrate empirically that the “feedback effect” is able to improve performance of record linkage.
1.1 Prior Work
Our work builds off [18, 16, 17, 14], which all proposed Bayesian record linkage models well suited for categorical data.  modeled the fully observed records through the “hit-and-miss” measurement error model . One natural way to handle record linkage uncertainty is via a joint model of the record linkage and downstream task.  introduced a record linkage model for continuous data based on a multivariate normal model with measurement error. Turning to just record linkage tasks, [16, 17] were the first to perform simultaneous record linkage and de-duplication on multiple files by using the fully observed records, creating a scalable record linkage algorithm. In similar work, de-duplication in a single database framework was tackled from a Bayesian perspective in  by using the information provided by the comparison data.
Related work regarding the record linkage and downstream task has been considered under specific assumptions. 
assumed that the two databases represent a permutation of the same database of units and proposed an estimator (LL) of the regression coefficients which is unbiased, conditionally on the matching probabilities provided by the record linkage task.
extended this approach to handle more complex and realistic linkage scenarios and logistic regression problems. Generalizations of the LL estimator have been also provided by using estimating equations. In addition, 
proposed to consider the probabilities of being a match — provided by the record linkage algorithm — as an ingredient to be used within a multiple imputation scenario. Finally, proposed a Bayesian method that jointly models the record linkage and the association between the overlapping features in two different databases. The authors consider somewhat simpler situation where the number of records to match in the two databases is relatively small and relies upon a specific blocking criteria. In addition, one potential limitation of the approach is the assumption of specific matching pattern. For each single block of comparisons, all cases in the smaller database will certainly appear in the other databases. We refer to  for details.
Section 2 introduces our Bayesian record linkage model, providing extensions to priors on random partitions. Section 3 generalizes our record linkage methodology to the downstream task of linear regression. Section 4 provides experiments for the record linkage task on synthetic data. We then provide three experiments on the joint record linkage and downstream task of linear regression on synthetic data. Section 5 provides a discussion and extensions to future work.
2 Bayesian Record Linkage and Priors on Partitions
In this section, we introduce notation used through the paper, our Bayesian record linkage model, and an alternative and more intuitive construction for the prior on co-referent records, known as the linkage structure
Assume databases (lists, data sets, or files) that consist of either qualitative and/or categorical records, which are noisy due to the data collection process. Each record corresponds to an underlying latent entity (statistical unit) of partially overlapping samples (or populations). In addition, assume all databases have overlapping features (fields). Assume sets of records are collected from a given population of size where in the same framework as [17, 15]. As such, assign a label () to each member of the population. Next, let
) be the vector of thecategorical overlapping features for the population individual . Finally, denote the entire set of population records by
2.2 Bayesian record linkage model
Assume the set of population records is generated independently, for
, from a vector of independent categorical variablessuch that and
where is the number of categorical values for the th feature. At the sample level, assume that one does not observe the “true” population values due to measurement errors. Thus, the observed records, which is a database of size , , consists of distorted versions of subsets of the vectors Let denote the observed values for the -th record of the -th database, where and . Denote the observed records (across the databases) by Next, let the set of latent indicator variables denote the unknown co-reference (matching) pattern between the observed records and the population records where indicates that the population record generated the observed record .111The relation , with , implies that records and of the -th database represents are co-referent to the same population record. This is an instance of duplicate-detection within the same database. When , with , one has the usual record linkage framework with the same individual appearing in two different databases. In general, let denote the linkage structure.
Next, we formalize the distortion mechanism when the population records are observed in the databases using the hit-and-miss model . Let
be the random variable that generates observed record. Assume that , that is, has the same support of . Let if and if , which implies that
for where represents the distortion probability for the -th overlapping feature. Here, the true population value is observed with probability , and a different value is drawn from the random variable generating the population values with probability Finally, assuming the conditional independence among all the overlapping features given their respective unobserved population counterparts, one obtains
We assume that the distortion probabilities are exchangeable, that is
and we assume the probabilities are considered known and equal to the corresponding population frequencies. The model summarized by equations (1) and (3) can be viewed as a latent variable model where the unobserved population records generate the observed records and can be viewed as the unknown model distortion parameter.
Remark: A convenient property of the hit-miss model is that one can integrate out the unknown population values to directly obtain the distribution . The resulting marginal distribution is the product of within-cluster distributions. To improve mixing, we use a Metropolis within Gibbs algorithm to simulate from the joint posterior (See Appendix 0.A).
2.3 The prior distribution for
In this section, we propose a more intuitive and subjective construction of a prior distribution on Let denote the random partition of the observed records determined by and let denote the set containing all the possible partitions of the observed records. The distribution on the sample labels induces a distribution on . Furthermore, matches and duplicates are completely specified given the knowledge of the random partition , which is invariant with respect to the labelings of the partition blocks. Given this construction, one can directly focus on the partition distribution of the observed records without linking the labels distribution to a sample design and to a population size , see for example, . One can effectively consider the distribution of as a prior distribution for the latent linkage structure and concentrate only on its probabilistic properties. Both the interpretations of the role of (either as a consequence of the sampling design or a model represented by partitions) may provide useful insights for a correct choice of its prior distribution. One difficult and related question in the record linkage literature has been the subjective specification on the space of partitions. A simple, alternative prior for the number of distinct entities can be obtained looking at the following allocation rule for the record labels which is based on a generalization of the Chinese Restaurant Process, namely the Pitman-Yor process (PYP) (, ). (See Appendix 0.B for details).
3 The downstream task of linear regression
In this section, we propose record linkage methodology for the downstream task of linear regression. Consider the model for the population units, where the goal is to estimate the regression coefficients . We observe and where represents a noisy measurement of the true covariates and is a random copy of the corresponding population variable .
To better illustrate our approach, we consider two scenarios. In the first scenario—the complete regression scenario
—each database reports a set of overlapping features, the response variable, and the covariates. Letand denote the observed values for the -th unit of the -th database, where and . In addition, let denote the entire set of regression data observed across the databases. In the complete scenario, there is not a bias problem concerning the estimation of the coefficients. In the second scenario—the broken regression scenario—we assume that the overlapping features are observed in each database, the response variable is observed in only the first database, and specific subsets of covariates are observed in the other databases. In this situation, let denote the observation , where and , where and . Note that represents only a fixed subset of the values for . Here, there is a bias issue regarding estimating the coefficients.222In both scenarios, we assume that the covariates have zero mean and the regression model does not have an intercept.
3.1 Simple linear regression
In this section, we consider linear regression and the two scenarios mentioned above with a single covariate First, consider the complete regression scenario. Let be the true value of observation corresponding to the records of cluster . Now consider a cluster with one record. Given the true value of and membership to cluster , we assume that the response variable follows a standard normal regression model with covariate , where the observed value for the covariate is normal with mean and and are independent. That is,
We assume that , which allows one to integrate via equation 4. In fact, setting , one can easily show that conditionally on the event , it follows that
For ease of notation, let denote the identity matrix, denote the -vector of zero; denote a vector of all ’s, and . Next, set
Consider a cluster with two records. The two pairs and are random vectors, both depending on the same “true” value . Let be the Kronecker product. Conditionally on and on the cluster membership, we replicate the model for a cluster with one record by assuming that and
are two independent bivariate normal random variables with joint distribution
Then the marginal distribution of is
This argument can be extended to any cluster size. When card, the marginal distribution of is again multivariate normal:
Next, consider the broken regression scenario
. In this case, when some information is missing—either the covariate in the first database or the response variable in some of the other databases—one can easily marginalize over the missing variables by using standard properties of multivariate normal distribution. Letdenote the set of regression observations, which conditionally on , correspond to the -th population unit. For example, for a cluster with one record in the first database, we denote this as . Using the marginal density of in equation 5, we can write the likelihood, conditional on , as Similarily, suppose with , then and the likelihood is given by marginal density of . Next, consider a cluster with a record in the first database and the other record in a different database, i.e. It follows that and the corresponding likelihood is found by marginalizing over the missing values in equation 6, where we obtain the joint density in equation 5. Finally, it follows that the likelihood function (as a function of ) for both the complete and broken regression scenarios can be generally written as 333 We assume that population units that do not have an observed cluster size contribute to the likelihood with a factor equal to 1.
In order to handle the record linkage and downstream regression task simultaneously, we assume conditional independence on between the overlapping features in the record linkage model and the set of variables in the downstream task of linear regression. Assuming conditional independence, we find
The first factor is related to the record linkage process, and second factor is related to the downstream task of linear regression, and the other factors represent the prior distributions. We assume independent diffuse priors for . To update the appropriate regression parameters , we use the Metropolis-Hastings algorithm in Appendix 0.A. Using the factorization of the posterior in equation (7), the proposed method can be generalized to any statistical model.
3.2 Multiple linear regression
We extend the downstream task to that of multiple regression, first considering the complete regression scenario. Let denote a cluster of size , denote a vector with observations of the response variable in this cluster, and denote the matrix with the values of the covariates observed in the cluster units. Let denote the vector of elements with the rows of the matrix vertically stacked and let denote the vector containing the true values of the covariates. Equation 4 can be generalized assuming that
This way the marginal distribution of is -variate normal with zero mean and covariance matrix
which simplifies into
The likelihood provided by the multiple regression model is the product of the factors for the observed clusters. The same considerations from linear regression regarding modeling the prior and the computational aspects apply to multiple linear regression. Note the major difference is in the marginalization pattern in the broken regression scenario. In fact, for a cluster joining records across more than one database, we may need to integrate out the covariate values missing in the databases that share a cluster.
To investigate the performance of our proposed methodology we consider the RLdata500 data set from the RecordLinkage package in R. This synthetic data set consists of 500 records, each comprising first and last name and full date of birth. We modify this data set to consider two databases, where each database contains 250 records, respectively, with duplicates in and across the two databases. To consider the case without duplicate detection, we modify the original RLdata500 such that it has no duplicate records within each of the two databases. Without duplicate detection is a special case of our general methodology (see Appendix 0.C). We provide experiments for both record linkage and the downstream task.
4.1 Record linkage with and without duplicate-detection
We provide two record linkage experiments — one with duplicate detection and one without duplicate detection. In Figures 1 and Figure 2, we report the prior and the posterior for and the performance of the record linkage procedure measured in terms of the posteriors of the false negative rates (FNR) and the false discovery rates (FDR). (For a review of FNR and FDR, see [15, 1]).
Figure 1 (with duplicate detection) illustrates that the resulting posteriors of appears robust to the choices of and (first row). We observe similar behavior for the posteriors of FNR and FDR (second and third rows). Figure 2 illustrates that as we vary the PYP parameters, the posterior of is weakly dependent on their values. The two database framework without duplicate detection leads, a posteriori, to similar FNR (second row) and lower FDR (third row) compared to the previous case. (See Appendix 0.D for the PYP parameter settings).
4.2 Regression experiments
We consider three regression experiments on the RLdata500 data set. In Experiment I, we consider the complete regression scenario in a single database framework with duplicate detection. In Experiment II, we consider the broken regression scenario with record linkage and duplicate detection. In Experiment III, we consider the broken multiple regression scenario in a two database framework without duplicate detection. (See Appendix 0.E for details).
Figure 3 gives the results of Experiment I. The posteriors of , , from our joint modeling approach (first row, solid lines) do not show remarkable differences when compared to their true counterpart (first row, dotted lines), which were obtained by fitting the regression model conditional on the true value of The similarity between the posteriors is mainly due to the large concentration of around the true pattern of duplications. The mode of the posterior of the number of distinct entities is exactly the true value (450), where the FNR and FDR are considerably smaller with respect to case without the and columns. Hence, the effect of considering the information provided by the regression model has improved the record linkage process.
Figure 4 gives the results of Experiment II. The posteriors (first row, solid lines) of are similar to the corresponding true posteriors (first row, dashed lines). We report the posteriors obtained by fixing equal to the point estimate provided by the hit-and-miss model applied to the categorical variables alone (first row, dotted lines). The posteriors of and obtained with the plug-in approach are strongly biased for the presence of false matches which, on the other hand, are not affecting the posterior of . This distribution depends on the 13 duplicated entities with two copies of which are correctly accounted for in the plug-in approach. To better illustrate the causes of the distortion in the estimation of the regression parameters, the right panel on the top row shows all the pairs resulting from the plug-in approach. The solid black circles represent the true matches, and the empty red circles represent the false matches, with independent and values. We report the corresponding regression lines, where the three false matches are lowering the estimate and increasing the estimate. Further analysis reveals that the posterior for (second row) with the integrated hit-miss and regression model is less concentrated with respect to the first experiment but it is more concentrated with respect to the single hit-miss model. We reduce the FDR, leaving the FNR almost unchanged. We coin this the feedback effect
of the regression from the downstream task. For example, if we consider a false link, the posterior probability of being a match will typically be down-weighted by the low likelihood arising from the regression part of the model. Hence, in addition to centering the estimates of the regression coefficient, the joint regression-hit miss model improves record linkage performance.
gives the results of the Experiment III. The joint model gives posteriors similar to the true ones while the plug-in approach gives biased estimates and larger variability (first row, left upper panels). The presence of false matches in the plug-in approach gives a positive bias in estimating the varianceand affects the posterior of the measurement error parameters (first row, right upper panels). The posteriors of and (not reported) both with the joint model and the true are essentially equal to the prior, while the plug-in posterior is concentrated on larger values. Under such conditions, even with the true linkage structure, we do not have any useful information for estimating the measurement error variances due to the lack of duplicated values. Thus, while the joint model correctly does not contrast the information provided by the prior, the presence of false matches creates pairs that could be also explained by a larger measurement error of the covariates. We observe that the joint modeling of the record linkage and regression data improves the matching process as noted by the higher concentration of (second row, left lower panel) around the true value of 450 and the lower FNRs and FDRs (second row, right lower panels) with respect to results obtained with the hit-and-miss model only.
We have made three major contributions in this paper. First, we have proposed a Bayesian record linkage model investigating the role that prior partition models may have on the matching process. Second, we have proposed a generalized framework for record linkage and regression that accounts for the record linkage error exactly. Using our methodology, one is able to generate a feedback mechanism of the information provided by the working statistical model on the record linkage process. This feedback mechanism is essential to eliminate potential biases that can jeopardize the resulting post-linkage inference. Third, we illustrate our record linkage and multiple regression methodology on many experiments involving a synthetic data set, where improvements are gained in terms of standard record linkage evaluation metrics.
Steorts was supported by NSF-1652431 and NSF-1534412. Tancredi and Liseo were supported by Ministero dell’ Istruzione dell’ Universita e della Ricerca, Italia PRIN 2015.
- Christen  Christen, P. (2012). Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer.
- Copas and Hilton  Copas, J. and Hilton, F. (1990). Record linkage: statistical models for matching computer records. Journal of the Royal Statistical Society, A, 153 287–320.
- De Blasi et al.  De Blasi, P., Favaro, S., Lijoi, A., Mena, R., Prunster and Ruggiero, M. (2015). Are Gibbs-Type Priors the Most Natural Generalization of the Dirichlet Process? IEEE Tranactions on Pattern Analysis and Machine Intelligence, 37,2 803–821.
- Goldstein et al.  Goldstein, H., Harron, K. and Wade, A. (2012). The analysis of record-linked data using multiple imputation with data value priors. Statistics in Medicine, 31 3481–3493.
- Gutman et al.  Gutman, R., Afendulis, C. C. and Zaslavsky, A. M. (2013). A Bayesian procedure for file linking to analyze end-of-life medical costs. Journal of the American Statistical Association, 108 34–47.
- Gutman et al.  Gutman, R., Sammartino, C., Green, T. and Montague, B. (2016). Error adjustments for file linking methods using encrypted unique client identifier (euci) with application to recently released prisoners who are HIV+. Statistics in Medicine, 35 115–129.
- Hof and Zwinderman  Hof, M. and Zwinderman, A. (2012). Methods for analyzing data from probabilistic linkage strategies based on partially identifying variables. Statistics in Medicine, 31 4231–4242.
- Kim and Chambers  Kim, G. and Chambers, R. (2012). Regression analysis under incomplete linkage. Computational Statistics & Data Analysis, 56 2756–2770.
- Lahiri and Larsen  Lahiri, P. and Larsen, M. D. (2005). Regression analysis with linked data. Journal of the American Statistical Association, 100 222–230.
- Liseo and Tancredi  Liseo, B. and Tancredi, A. (2011). Bayesian estimation of population size via linkage of multivariate normal data sets. Journal of Official Statistics, 27 491–505.
- MacEachern  MacEachern, S. N. (1994). Estimating normal means with a conjugate style Dirichlet process prior. Communications in Statistics-Simulation and Computation, 23 727–741.
- Neal  Neal, R. M. (2000). Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9 249–265.
- Pitman  Pitman, J. (2006). Combinatiorial Stochastic Processes. Ecole d’Eté de Probabilités de Saint-Flour XXXII, Lecture Notes in Mathematics, vol. 1875, Berlin, Springer.
- Sadinle  Sadinle, M. (2014). Detecting duplicates in a homicide registry using a Bayesian partitioning approach. The Annals of Applied Statistics, 8 2404–2434.
- Steorts  Steorts, R. C. (2015). Entity resolution with empirically motivated priors. Bayesian Analysis, 10 849–875.
Steorts et al. 
Steorts, R. C., Hall, R. and Fienberg, S. E. (2014).
SMERED: A Bayesian approach to graphical record linkage and
Journal of Machine Learning Research, 33 922–930.
- Steorts et al.  Steorts, R. C., Hall, R. and Fienberg, S. E. (2016). A Bayesian approach to graphical record linkage and de-duplication. Journal of the American Statistical Society.
- Tancredi and Liseo  Tancredi, A. and Liseo, B. (2011). A hierarchical Bayesian approach to record linkage and population size problems. Annals of Applied Statistics, 5 1553–1585.
- Yamato and Shibuya  Yamato, H. and Shibuya, M. (2000). Moments of some statistics of pitman sampling formula. Bulletin of informatics and cybernetics, 32 1–10.
Appendix 0.A Metropolis Algorithm
We provide our Metropolis-within-Gibbs algorithm that allows direct simulation from the joint posterior. Let be the vector with the element removed and let be the set of all the observed records with record removed, which conditionally on , refer to the population individual The full conditional distribution of is
where In equation 8, set , which implies that It follows that When the population entity represents an already existing cluster given , the above ratio can also be written as
When the label identifies a new cluster, the following simplification is possible:
Note that the posterior is invariant with respect to the cluster labels and that we are only interested in the cluster composition. Thus, we can avoid simulating the entire population label distribution, and instead set (since there can be at most clusters) and update with the following:
for , where is the number of clusters without the label . This way of updating the cluster assignment is standard when the CRP is used for a prior on the cluster assignments. In addition, the marginal likelihood of the cluster observations is known or can be easily calculated using a recursive formula, see for example  and .
To adapt the algorithm to the two different prior distribution of , note that, when labels an observed cluster, the use of a uniform prior on implies that
With the PYP prior, the above mentioned probabilities are, respectively,
where here denotes the size of the cluster without the entity . Finally, when a uniform prior on the partition space is considered, one has
Finally, full conditional distributions of the components of have a computationally manageable form using a recursive formula. In fact, assuming a standard Beta prior on each , one obtains
and a straightforward Metropolis step can be easily implemented.
Appendix 0.B Construction of PYP Priors
We now briefly describe adapting the PYP prior to our database framework. Assume the first records of the -th database and all the records of the first
databases are classified intoclusters identified by the the population labels with sizes respectively. Also, let denote the total number of these records. Next, the label of the record identifies a new cluster with probability
where are two parameters whose admissible values are with or with for some positive integer . Moreover, will assume an already observed label identifying a cluster with size with probability
The above updating rule induces a prior on the set of the possible partitions of all the records which can be written as 
where are the cluster sizes of the partition and It can also be proved  that, under this prior, the expected value of is
and the variance is
with . For more details, we refer to .
The above equations can be used for prior elicitation by fixing and in order to have equal to a rough prior guess for the number of clusters and a specific amount of prior variability for . Moreover, in evaluating the asymptotic properties,  observes that as , becomes infinite for non negative values of ; on the other hand, if is negative, is equal almost surely to which thus takes the role of the size in a finite population framework.
Appendix 0.C Record linkage without duplicate-detection
We now consider record linkage of two databases without duplicate-detection. To consider this case, we simply modify the prior distribution on the ’s such that and for . In this case, clusters consist of at most two elements so that the distribution of the observed records , conditional on and , can be calculated analytically without exploiting the recursive formula.
If a uniform prior on the label space is assumed, the above conditioning is equivalent to assuming that the two databases are two simple random samples with replacement from a population of units. This is the same situation described in , where is assumed unknown. Assume that denotes the number of common units between the two databases; then is equal to , where
follows a hypergeometric distribution
From a computational perspective, the conditioning of the uniform prior does not imply substantial changes. In fact if a PYP prior is assumed, the standard record linkage framework can be tackled by imposing that for and that the units of the second database may only join a cluster composed by a single unit of the first database or create a new cluster, that is
where and is the number of distinct elements considering the first database and the first elements of the second database. Finally, notice that
This implies that the ’s are no longer exchangeable. This problem, although interesting from a theoretical perspective, does not cause computational issues.
The conditional prior probabilities for the Gibbs step updating ofto be used from equation (10) are
Appendix 0.D Record Linkage Experiment
We provide the parameter settings for the record linkage experiments. For the case with duplicate detection, we considered the effect of the PYP prior for with =(0.4,0.98), (2,0.975), (10,0.965). These prior distributions have a common prior mean of almost equal to 450; however, their respective variance are quite different. For the case of no duplicate detection, we consider the effect of the constrained PYP prior for with and These values of the hyper-parameters produce prior means for the number of matches equal to 75, 50 and 25.
Appendix 0.E Regression Experiments
We elaborate on our three regression experiments. In the first experiment, we modify the data set by adding two columns with the pairs and