Bayesian Estimation of Bipartite Matchings for Record Linkage

01/25/2016 ∙ by Mauricio Sadinle, et al. ∙ Duke University 0

The bipartite record linkage task consists of merging two disparate datafiles containing information on two overlapping sets of entities. This is non-trivial in the absence of unique identifiers and it is important for a wide variety of applications given that it needs to be solved whenever we have to combine information from different sources. Most statistical techniques currently used for record linkage are derived from a seminal paper by Fellegi and Sunter (1969). These techniques usually assume independence in the matching statuses of record pairs to derive estimation procedures and optimal point estimators. We argue that this independence assumption is unreasonable and instead target a bipartite matching between the two datafiles as our parameter of interest. Bayesian implementations allow us to quantify uncertainty on the matching decisions and derive a variety of point estimators using different loss functions. We propose partial Bayes estimates that allow uncertain parts of the bipartite matching to be left unresolved. We evaluate our approach to record linkage using a variety of challenging scenarios and show that it outperforms the traditional methodology. We illustrate the advantages of our methods merging two datafiles on casualties from the civil war of El Salvador.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Joining data sources requires identifying which entities are simultaneously represented in more than one source. Although this is a trivial process when unique identifiers of the entities are recorded in the datafiles, in general it has to be solved using the information that the sources have in common on the entities. Most of the statistical techniques currently used to solve this task are derived from a seminal paper by FellegiSunter69 who formalized procedures that had been proposed earlier (see Newcombeetal59; NewcombeKennedy62, and references therein). A number of important record linkage projects have been developed under some variation of the Fellegi-Sunter approach, including the merging of the 1990 U.S. Decennial Census and Post-Enumeration Survey to produce adjusted Census counts (WinklerThibaudeau91), the Generalized Record Linkage System at Statistics Canada (Fair04), the Person Identification Validation System at the U.S. Census Bureau (WagnerLayne14), and the LinkPlus software at the U.S. CDCLinkPlus, among many others (e.g., GillGoldacre03; Singleton13).

In this article we are concerned with bipartite record linkage, where we seek to merge two datafiles while assuming that each entity is recorded maximum once in each file. Most of the statistical literature on record linkage deal with this scenario (FellegiSunter69; Jaro89; Winkler88; Winkler93; Winkler94; BelinRubin95; LarsenRubin01; HerzogScheurenWinkler07; TancrediLiseo11; Gutmanetal13). Despite the popularity of the Fellegi-Sunter approach and its variants to solve this task, it is also recognized to have a number of caveats (e.g., Winkler02). In particular, the no-duplicates within-file assumption implies a maximum one-to-one restriction in the linkage, that is, a record from one file can be linked with maximum one record from the other file. Modern implementations of the Fellegi-Sunter methodology that use mixture models ignore this restriction (Winkler88; Jaro89; BelinRubin95; LarsenRubin01), leading to the necessity of enforcing the maximum one-to-one assignment in a post-processing step (Jaro89). Furthermore, this restriction is also ignored by the decision rule proposed by FellegiSunter69

to classify pairs of records into

links, non-links, and possible links, and therefore the conditions for its theoretical optimality are not met in practice.

Despite the weaknesses of the Fellegi-Sunter approach, it has a number of advantages on which we build in this article, in addition to pushing forward existing Bayesian improvements. After clearly defining a bipartite matching as the parameter of interest in bipartite record linkage (Section 2), in Section 3 we review the traditional Fellegi-Sunter methodology, its variants and modern implementations using mixture models, and we provide further details on its caveats. In Section 4 we improve on existing Bayesian record linkage ideas, in particular we extend the modeling approaches of Fortinietal01 and Larsen02; Larsen05; Larsen10; Larsen12 to properly deal with missing values and capture partial agreements when comparing pairs of records. Most importantly, in Section 5 we derive Bayes estimates of the bipartite matching according to a general class of loss functions. Given that Bayesian approaches allow us to properly quantify uncertainty in the matching decisions we include a rejection option in our loss functions with the goal of leaving uncertain parts of the bipartite matching undeclared. The resulting Bayes estimates provide an alternative to the Fellegi-Sunter decision rule. In Section 6 we compare our Bayesian approach with the traditional Fellegi-Sunter methodology under a variety of linkage scenarios. In Section 7 we consider the problem of joining two data sources on civilian casualties from the civil war of El Salvador, and we explain the advantages of using our estimation procedures in that context.

2 The Bipartite Record Linkage Task

Consider two datafiles and that record information from two overlapping sets of individuals or entities. These datafiles contain and records, respectively, and without loss of generality we assume . These files originate from two record-generating processes that may induce errors and missing values. We assume that each individual or entity is recorded maximum once in each datafile, that is, the datafiles contain no duplicates. Under this set-up the goal of record linkage can be thought of as identifying which records in files and refer to the same entities. We denote the number of entities simultaneously recorded in both files by , and so . Formally, our parameter of interest can be represented by a bipartite matching between the two sets of records coming from the two files, as we now explain.

2.1 A Bipartite Matching as the Parameter of Interest

We briefly review some basic terminology from graph theory (see, e.g., LovaszPlummer86). A graph consists of a finite number of elements called nodes and a set of pairs of nodes called edges. A graph whose node set can be partitioned into two disjoint non-empty subsets and is called bipartite if each of its edges connects a node of with a node of . A set of edges in a graph is called a matching if all of them are pairwise disjoint. A matching in a bipartite graph is naturally called a bipartite matching (see the example in Figure 1).

1

2

3

4

5

1

2

3

4

Figure 1: Example of bipartite matching represented by the edges in this graph.

In the bipartite record linkage context we can think of the records from files and as two disjoint sets of nodes, where an edge between two records represents them referring to the same entity, which we also call being coreferent or being a match. The assumption of no duplicates within datafile implies that edges between records of the same file are not possible. Furthermore, given that the relation of coreference between records is transitive, the graph has to represent a bipartite matching, because if two edges had an overlap, say and , , by transitivity we would have that and would be coreferent, which contradicts the assumption of no within-file duplicates.

A bipartite matching can be represented in different ways. The matrix representation consists of creating a matching matrix of size whose th entry is defined as

The characteristics of a bipartite matching imply that each column and each row of contain maximum one entry being equal to one. This representation has been used by a number of authors (LiseoTancredi11; TancrediLiseo11; Fortinietal01; Fortinietal02; Larsen02; Larsen05; Larsen10; Larsen12; Gutmanetal13) but it is not very compact. We propose an alternative way of representing a bipartite matching by introducing a matching labeling for the records in the file such that

Naturally we can go from one representation to the other using the relationship , where is the indicator function. We shall use either representation throughout the document depending on which one is more convenient, although matching labelings are better suited for computations.

2.2 Approaches to Bipartite Record Linkage

The goal of bipartite record linkage is to estimate the bipartite matching between two datafiles using the information contained in them. There are a number of different approaches to do this depending on the specific characteristics of the problem and what information is available.

A number of approaches directly model the information contained in the datafiles (Fortinietal02; Matsakis10; LiseoTancredi11; TancrediLiseo11; Gutmanetal13; Steortsetal13), which requires crafting specific models for each type of field in the datafile, and are therefore currently limited to handle nominal categorical fields, or continuous variables modeled under normality. In practice, however, fields that are complicated to model, such as names, addresses, phone numbers, or dates, are important to merge datafiles.

A more common way of tackling this problem is to see it as a traditional classification problem: we need to classify record pairs into matches and non-matches. If we have access to a sample of record pairs for which the true matching statuses are known, we can train a classifier on this sample using comparisons between the pairs of records as our predictors, and then predict the matching statuses of the remaining record pairs (e.g., Cochinwalaetal01; Bilenkoetal03; Christen08; Sametal13). Nevertheless, classification methods typically assume that we are dealing with i.i.d. data, and therefore the training of the models and the prediction using them heavily rely on this assumption. In fact, given that these methods output independent matching decisions for pairs of records, they lead to conflicting decisions since they violate the maximum one-to-one assignment constraint of bipartite record linkage. Typically some subsequent post-processing step is required to solve these inconsistencies.

Finally, perhaps the most popular approach to record linkage is what we shall call the Fellegi-Sunter approach, although many authors have contributed to it over the years. Despite its difficulties, this approach does not require training data and it can handle any type of field, as long as records can be compared in a meaningful way. Given that training samples are too expensive to create and datafiles often contain information that is too complicated to model, we believe that the Fellegi-Sunter approach tackles the most common scenarios where record linkage is needed. We therefore review this approach in more detail in the next section, and in the remainder of the article we shall refrain from referring to the direct modeling and supervised classification approaches.

3 The Fellegi-Sunter Approach to Record Linkage

Following FellegiSunter69, we can think of the set of ordered record pairs as the union of the set of matches and the set of non-matches . The goal when linking two files can be seen as identifying the sets and . When record pairs are estimated to be matches they are called links and when estimated to be non-matches they are called non-links. The Fellegi-Sunter approach uses pairwise comparisons of the records to estimate their matching statuses.

3.1 Comparison Data

In most record linkage applications two records that refer to the same entity should be very similar, otherwise the amount of error in the datafiles may be too large for the record linkage task to be feasible. On the other hand, two records that refer to different entities should generally be very different. Comparison vectors

are obtained for each record pair in with the goal of finding evidence of whether they represent matches or not. These vectors can be written as , where denotes the number of criteria used to compare the records. Traditionally these criteria correspond to one comparison per each field that the datafiles have in common.

The appropriate comparison criteria depend on the information contained by the records. The simplest way to compare the same field of two records is to check whether they agree or not. This strategy is commonly used to compare unstructured nominal information such as gender or race, but it ignores partial agreements when used with strings or numeric measurements. To take into account partial agreement among string fields (e.g., names) Winkler90Strings proposed to use string metrics, such as the normalized Levenshtein edit distance or any other (see Bilenkoetal03; Elmagarmidetal07), and divide the resulting set of similarity values into different levels of disagreement. Winkler’s approach can be extended to compute levels of disagreement for fields that are not appropriately compared in a dichotomous fashion.

Let denote a similarity measure computed from field of records and . The range of can be divided into intervals , that represent different disagreement levels. In this construction the interval represents the highest level of agreement, which includes total agreement, and the last interval

represents the highest level of disagreement, which depending on the field represents complete or strong disagreement. This allows us to construct the comparison vectors from the ordinal variables:

The larger the value of , the more record and record disagree in field .

Although in principle we could define using the original similarity values , in the Fellegi-Sunter approach these comparison vectors need to be modeled. Directly modeling the original

requires a customized model per type of comparison given that these similarity measures output values in different ranges depending on their functional form and the field being compared. By building disagreement levels as ordinal categorical variables, however, we can use a generic model for any type of comparison, as long as its values are categorized.

The selection of the thresholds that define the intervals should correspond with what are considered levels of disagreement, which depend on the specific application at hand and the type of field being compared. For example, in the simulations and applications presented here we build levels of disagreement according to what we consider to be no disagreement, mild disagreement, moderate disagreement, and extreme disagreement.

3.2 Blocking

In practice, when the datafiles are large the record linkage task becomes too computationally expensive. For example, the cost of computing the comparison data alone grows quadratically since there are record pairs. A common solution to this problem is to partition the datafiles into blocks of records determined by information that is thought to be accurately recorded in both datafiles, and then solve the task only within blocks. For example, in census studies datafiles are often partitioned according to ZIP Codes (postal codes) and then only records sharing the same ZIP Code are attempted to be linked, that is, pairs of records with different ZIP Codes are assumed to be non-matches (HerzogScheurenWinkler07). Blocking can be used with any record linkage approach and there are different variations (see Christen12, for an extensive survey). Our presentation in this paper assumes that no blocking is being used, but in practice if blocking is needed the methodologies can simply be applied independently to each block.

3.3 The Fellegi-Sunter Decision Rule

The comparison vector alone is insufficient to determine whether , since the variables being compared usually contain random errors and missing values. FellegiSunter69 used the log-likelihood ratios

(1)

as weights to estimate which record pairs are matches. Expression (1) assumes that is a realization of a random vector, say, whose distribution depends on the matching status of the record pair. Intuitively, if this ratio is large we favor the hypothesis of the pair being a match. Although this type of likelihood ratio was initially used by Newcombeetal59 and NewcombeKennedy62, the formal procedure proposed by FellegiSunter69 permits finding two thresholds such that the set of weights can be divided into three groups corresponding to links, non-links, and possible links. The procedure orders the possible values of by their weights in non-increasing order, indexing by the subscript , and determines two values, and , such that

and

where and are two admissible error levels. Finally, the record pairs are divided into three groups: (1) those with being links, (2) those with being non-links, and (3) those with configurations between and requiring clerical review. FellegiSunter69 showed that this decision rule is optimal in the sense that for fixed values of and

it minimizes the probability of sending a pair to clerical review.

We notice that in the presence of missing data the sampling distribution of the comparison vectors changes with each missingness pattern, and therefore so do the thresholds and . The caveats of this decision rule are discussed in Section 3.6.

3.4 Enforcing Maximum One-to-One Assignments

The Fellegi-Sunter decision rule does not enforce the maximum one-to-one assignment restriction in bipartite record linkage. For example, if records and in are very similar but are non-coreferent by assumption, and if both are similar to , then the Fellegi-Sunter decision rule will probably assign and as links, which by transitivity would imply a link between and (a contradiction). As a practical solution to this issue, Jaro89 proposed a tweak to the Fellegi-Sunter methodology. The idea is to precede the Fellegi-Sunter decision rule with an optimal assignment of record pairs obtained from a linear sum assignment problem. The problem can be formulated as the maximization:

(2)
subject to

with given by Expression (1), where the first constraint ensures that represents a discrete structure, and the second and third constraints ensure that each record of is matched with at most one record of and vice versa. This is a maximum-weight bipartite matching problem, or a linear sum assignment problem, for which efficient algorithms exist such as the Hungarian algorithm (see, e.g., PapadimitriouSteiglitz82). The output of this step is a bipartite matching that maximizes the sum of the weights among matched pairs, and the pairs that are not matched by this step are considered non-links. Although Jaro89 did not provide a theoretical justification for this procedure, we now show that this can be thought of as a maximum likelihood estimate (MLE) under certain conditions, in particular under a conditional independence assumption of the comparison vectors which is commonly used in practice, such as in the mixture models presented in Section 3.5.

Proposition 1.

Under the assumption of the comparison vectors being conditionally independent given the bipartite matching, the solution to the linear sum assignment problem in Expression (2) is the MLE of the bipartite matching.

Proof.

where the first line arises under the assumption of the comparison vectors being conditionally independent given the bipartite matching , the second line drops a factor that does not depend on , and the last line arises from applying the natural logarithm. We conclude that is the solution to the linear sum assignment problem in Expression (2). ∎

When using there exists the possibility that the matching will include some pairs with a very low matching weight. Jaro89 therefore proposed to apply the Fellegi-Sunter decision rule to the pairs that are matched by to determine which of those can actually be declared to be links.

3.5 Model Estimation

The presentation thus far relies on the availability of and , but these probabilities need to be estimated in practice. In principle these distributions could be estimated from previous correctly linked files, but these are seldom available. As a solution to this problem Winkler88, Jaro89, LarsenRubin01, among others, proposed to model the comparison data using mixture models of the type

(3)

so that the comparison vector is regarded as a realization of a random vector whose distribution is either or depending on whether the pair is a match or not, respectively, with and representing vectors of parameters and representing the proportion of matches. The and models can be products of individual models for each of the comparison components under a conditional independence assumption (Winkler88; Jaro89), or can be more complex log-linear models (LarsenRubin01). The estimation of these models is usually done using the EM algorithm (DempsterLairdRubin77). Notice that the mixture model (3) relies on two key assumptions: the comparison vectors are independent given the bipartite matching, and the matching statuses of the record pairs are independent of one another.

3.6 Caveats and Criticism

Despite the popularity of the previous methodology for record linkage it has a number of weaknesses. In terms of modeling the comparison data as a mixture, there is an implicit “hope” that the clusters that we obtain are closely associated with matches and non-matches. In practice, however, the mixture components may not correspond with these groups of record pairs. In particular, the mixture model will identify two clusters regardless of whether the two files have coreferent records or not. Winkler02 mentions conditions for the mixture model to give good results based on experience working with large administrative files at the US Census Bureau:

  • The proportion of matches should be greater than 5%.

  • The classes of matches and non-matches should be well separated.

  • Typographical error must be relatively low.

  • There must be redundant fields that overcome errors in other fields.

In many practical situations these conditions may not hold, especially when the datafiles contain lots of errors and/or missingness, or when they only have a small number of fields in common. Furthermore, even if the mixture model is successful at roughly separating matches from non-matches, many-to-one matches can still happen if the assignment step proposed by Jaro89 is not used, given that the mixture model is fitted without the one-to-one constraint, in particular assuming independence of the matching statuses of the record pairs. We believe that a more sensible approach is to incorporate this constraint into the model (as in Fortinietal01; Fortinietal02; LiseoTancredi11; TancrediLiseo11; Larsen02; Larsen05; Larsen10; Larsen12; Gutmanetal13) rather than forcing it in a post-processing step.

Finally, we notice that even if the mixture model is fitted with the one-to-one constraint, the Fellegi-Sunter decision rule alone may still lead to many-to-many assignments and chains of links given that it assumes that once we know the distributions and , the comparison data determines the linkage decision. Furthermore, the optimality of the Fellegi-Sunter decision rule heavily relies on this assumption. We have argued, however, that the linkage decision for the pair not only depends on but also depends on the linkage decisions for the other pairs and , . In Section 5 we propose Bayes estimates that allow a rejection option as an alternative to the Fellegi-Sunter decision rule.

4 A Bayesian Approach to Bipartite Record Linkage

The Bayesian approaches of Fortinietal01 and Larsen02; Larsen05; Larsen10; Larsen12 build on the strengths of the Fellegi-Sunter approach but improve on the mixture model implementation by properly treating the parameter of interest as a bipartite matching, therefore avoiding the inconsistencies coming from treating record pairs’ matching statuses as independent of one another. Here we consider an extension of their modeling approach to handle missing data and to take into account multiple levels of partial agreement. The Bayesian estimation of the bipartite matching (which can be represented by a matching labeling or by a matching matrix ) has the advantage of providing a posterior distribution that can be used to derive point estimates and to quantify uncertainty about specific parts of the bipartite matching.

4.1 Model for Comparison Data

Our approach is similar to the mixture model presented in Section 3.5, with the difference that we consider the matching statuses of the record pairs as determined by a bipartite matching:

where and are models for the comparison vectors among matches and non-matches, as explained in Section 3.5, and represents a prior on the space of bipartite matchings, such as the one presented in Section 4.3.

4.2 Conditional Independence and Missing Comparisons

In this section we provide a simple parametrization for the models and that allow standard prior specification and make it straightforward to deal with missing comparisons. Under the assumption of the comparison fields being conditionally independent (CI) given the matching statuses of the record pairs we obtain that the likelihood of the comparison data can be written as

(4)

where denotes the probability of a match having level of disagreement in field , and represents the analogous probability for non-matches. We denote , , , , and . This model is an extension of the one considered by Larsen02; Larsen05; Larsen10; Larsen12, which in turn is a parsimonious simplification of the one in Fortinietal01, who only considered binary comparisons.

We now need to modify this model to accommodate missing comparison criteria since in practice it is rather common to find records with missing fields of information, which lead in turn to missing comparisons for the corresponding record pairs. For example, if a certain field that is being used to compute comparison data is missing for record , then the vector will be incomplete, regardless of whether the field is missing for record .

A simple way to deal with this situation is to assume that the missing comparisons occur at random (MAR assumption in LittleRubin02), and therefore we can base our inferences on the marginal distribution of the observed comparisons (LittleRubin02, p. 90). Under the parametrization of Equation (4) and the MAR assumption, after marginalizing over the missing comparisons it can be easily seen that the likelihood of the observed comparison data can be written as

(5)

with

(6)

where is the indicator of whether its argument is observed. For a given matching labeling , and represent the number of matches and non-matches with observed disagreement level in comparison . From Equations (5) and (6) we can see that the combination of the CI and MAR assumptions allow us to ignore the comparisons that are not observed while modeling the observed comparisons in a simple fashion.

Under the previous parametrization it is easy to use the independent conjugate priors

and for .

4.3 Beta Prior for Bipartite Matchings

We now construct a prior distribution for matching labelings where and . We start by sampling the indicators of which records in file have a match. Let , , where represents the proportion of matches expected a priori as a fraction of the smallest file . We take to be distributed according to a Beta a priori. In this formulation represents the number of matches according to matching labeling , and it is distributed according to a Beta-Binomial, after marginalizing over . Conditioning on knowing which records in file have a match, that is, conditioning on , all the possible bipartite matchings are taken to be equally likely. There are such bipartite matchings. Finally, the probability mass function for is given by

where B represents the Beta function. We shall refer to this distribution as the beta distribution for bipartite matchings. Notice that in this formulation the hyper-parameters and can be used to incorporate prior information on the amount of overlap between the files. This prior was first proposed by Fortinietal01; Fortinietal02 (with fixed ) and Larsen05; Larsen10 in terms of matching matrices.

4.4 Gibbs Sampler

We now present a Gibbs sampler to explore the joint posterior of and given the observed comparison data , for the likelihood and priors presented before. Although it is easy to marginalize over and derive a collapsed Gibbs sampler that iterates only over , we present the expanded version to show some connections with the Fellegi-Sunter approach.

We start the Gibbs sampler with an empty bipartite matching, that is for all . For a current value of the matching labeling , we obtain the next values , , for , and as follows:

  1. For , sample

    and

    Collect these new draws into . The functions and are presented in Equation (6).

  2. Sample the entries of sequentially. Having sampled the first entries of , we define , and sample a new label , with the probability of selecting label given by , which can be expressed as (for generic and ):

    (7)

    and can be expressed as

    (8)

for . From Equations (7) and (8) we can see that at a certain step of the Gibbs sampler the assignment of a record in file as a match of record will depend on the weight , as long as record does not match any other record of file according to . These are essentially the same weights used in the Fellegi-Sunter approach to record linkage (Section 3). In particular, if there are no missing comparisons, Equation (8) represents the composite weight proposed by Winkler90Strings to account for partial agreements. Equation (7) also indicates that the probability of not matching record with any record in file depends on the number of unmatched records in file

and the odds of a non-match in file

according to . The lower the number of current matches, the larger the probability of not matching record .

When using a flat prior on the space of bipartite matchings we obtain an expression similar to Equation (7), but with a probability of leaving record unmatched proportional to 1, indicating that under that prior the odds of a match do not take into account the number of existing matches. In practice this translates into larger numbers of false-matches under the flat prior for scenarios where the actual overlap of the datafiles is small, given that the evidence for a match does not have to be as strong as when using a beta prior for bipartite matchings. The point estimator presented in Section 3.4 suffers a similar phenomenon, as we show in Section 6.

5 Bayes Estimates of Bipartite Matchings

From a Bayesian theoretical point of view (e.g., Berger85; BernardoSmith94) we can obtain different point estimates of the bipartite matching using the posterior distribution of and different loss functions . The Bayes estimate for is the bipartite matching that minimizes the posterior expected loss . In this section we present a class of additive loss functions that can be used to derive different Bayes estimates.

In some scenarios some records may have a large matching uncertainty, and therefore a point estimate for the whole bipartite matching may not be appropriate. In Figure 2 we present a toy example where a record in file has three possible matches in file , making it difficult to take a reliable decision. The approach presented below allows the possibility of leaving uncertain parts of the bipartite matching unresolved. Decision rules in the classification literature akin to the ones presented here are said to have a rejection option (see, e.g., Ripley96; Hu14). The rejection option in our context refers to the possibility of not taking a linkage decision for a certain record. These unresolved cases can, for example, be hand-matched as part of a clerical review. We refer to point estimates with a rejection option as partial estimates, as opposed to full estimates which assign a linkage decision to each record.

Figure 2: Toy example of uncertain matching. DOB: Date of birth.

We work in terms of matching labelings instead of matching matrices , which means that we target questions of the type “which record in file (if any) should be matched with record in file ?” rather than “do records and match?” Working with makes it explicit that in bipartite record linkage there are linkage decisions to be made rather than .

We represent a Bayes estimate here as a vector , where , with representing the rejection option. We propose to assign different positive losses to different types of errors and compute the overall loss additively, as

(9)

with

(10)

that is, represents the loss from not taking a decision (rejection), is the loss from a false non-match decision, is the loss from a false match decision when the record does not actually match any other record, and is the loss from a false match decision when the record actually matches a different record than the one assigned to it. The posterior expected loss is given by

where

(11)

The Bayes estimate can be obtained, in general, by solving (minimizing) a linear sum assignment problem with a matrix of weights with entries

In this matrix the first rows accommodate the possibility of records in file linking to any record in file , the next rows accommodate the possibility of records in not linking to any record in , and the last rows represent the possibility of not taking linkage decisions (rejections) for the records in . Rather than working with this general formulation we now focus on some important particular cases that lead to simple derivations of the Bayes estimates.

5.1 Closed-Form Full Estimates with Conservative Link Assignments

We first consider the case where we are required to output decisions for all records. In this case fixing prevents outputting rejections. Letting represents the idea that the loss from a false non-match is not higher than the possible losses from a false match. Furthermore, the error of matching with a record when it actually matches another implies that record will not be matched correctly either, and therefore it is reasonable to take to be much larger than the other losses. In particular we work with .

Theorem 1.

If , and in the loss function given by Equations (9) and (10), the Bayes estimate of the bipartite matching is obtained from , where is given by

Proof.

The strategy for the proof is to obtain the optimal marginal value of each by minimizing each term shown in Equation (11). If this approach leads to a proper bipartite matching then it corresponds to the optimal solution of the problem given that the constraints for would hold.

To find the optimal value of we can start by finding the optimal label among . It is easy to see that minimizes if and only if it maximizes . Now, if is the best possible match for , the decision of matching with over not matching with any other record depends on whether , which can easily be checked to be equivalent to the inequality stated in the theorem.

Given that this solution was obtained ignoring the constraints that require for we need to make sure that it leads to a bipartite matching. Indeed, given the conditions on and we have and , which imply that under this solution only if . Since we are working with a posterior distribution on bipartite matchings we necessarily have that , given that the events are disjoint. This implies that for all , and so for all . We conclude that the solution given by the theorem satisfies the constrained problem. ∎

The conservative nature of the Bayes estimates obtained from Theorem 1 are evidenced from the fact that to declare a match between records and we require to be at least . Furthermore, in cases where record has a non-zero probability of matching other records besides , that is, when , if the decision rule in Theorem 1 is extra conservative increasing the threshold for declaring matches.

The Bayes estimate of Theorem 1 has an important particular case. TancrediLiseo11 derived a decision rule using the entrywise zero-one loss for matching matrices

which is equivalent to our additive loss function when , in Equations (9) and (10), and therefore we obtain the following corollary.

Corollary 1.1 (TancrediLiseo11).

If , and in the loss function given by Equations (9) and (10), the Bayes estimate of the bipartite matching is obtained from , where is given by

(12)

5.2 Closed-Form Partial Estimates

To emphasize the importance of the rejection option, let us refer to the toy example of Figure 2, where a record in file has three possible matches in file . If each of these matches is equally likely we necessarily have that , and likewise for and . In this case the optimal decision under the entrywise zero-one loss is to not match with any record in file . On the other hand, in the case of the bipartite matching MLE (Section 3.4), if the three weights , , are equal and positive then this estimate will arbitrarily match with one of , , or (if no other records are involved). This scenario illustrates the advantage of using a decision rule that allows us to leave uncertain parts of the bipartite matching unresolved.

We now present a particular case of our additive loss function that allows us to output rejections and leads to closed-form Bayes estimates. At the end of this section we explain why the constraints that we consider on the individual losses are meaningful in practice.

Theorem 2.

If either 1) , or 2) and , in the loss function given by Equations (9) and (10), the Bayes estimate of the bipartite matching can be obtained from , with , where is given by Expression (11).

Proof.

We need to show that is such that if then for , that is, we do not obtain conflicting matching decisions.

  1. Assume . According to the construction of in the theorem, if then , which is equivalent to

    Using this inequality along with the restrictions we obtain , which implies that because , and therefore for all .

  2. Assume and . If then . In the proof of Theorem 1 we showed that in such case if and then , which in turn implies that for all , and so for all .

We now present a particular case of Theorem 2 that allows an explicit expression for the Bayes estimate.

Theorem 3.

If and in the loss function given by Equations (9) and (10), the Bayes estimate of the bipartite matching can be obtained from , where is given by

if  ; (13a)
if   (13b)
otherwise.
Proof.

From Theorem <