Record Linkage to Match Customer Names: A Probabilistic Approach

06/26/2018 ∙ by Bahare Fatemi, et al. ∙ The University of British Columbia 0

Consider the following problem: given a database of records indexed by names (e.g., name of companies, restaurants, businesses, or universities) and a new name, determine whether the new name is in the database, and if so, which record it refers to. This problem is an instance of record linkage problem and is a challenging problem because people do not consistently use the official name, but use abbreviations, synonyms, different order of terms, different spelling of terms, short form of terms, and the name can contain typos or spacing issues. We provide a probabilistic model using relational logistic regression to find the probability of each record in the database being the desired record for a given query and find the best record(s) with respect to the probabilities. Building on term-matching and translational approaches for search, our model addresses many of the aforementioned challenges and provides good results when existing baselines fail. Using the probabilities outputted by the model, we can automate the search process for a portion of queries whose desired documents get a probability higher than a trust threshold. We evaluate our model on a large real-world dataset from a telecommunications company and compare it to several state-of-the-art baselines. The obtained results show that our model is a promising probabilistic model for record linkage for names. We also test if the knowledge learned by our model on one domain can be effectively transferred to a new domain. For this purpose, we test our model on an unseen test set from the business names of the secondString dataset. Promising results show that our model can be effectively applied to unseen datasets. Finally, we study the sensitivity of our model to the statistics of datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Many companies offer services that require searching their database for a text query specified by a user. A website containing reviews for restaurants lets a user find their desired restaurant by searching its name. A website containing scientific papers lets a user find their desired paper through searching its title. A telephone company needs to search through their customer records (individual names or company titles) for customer inquiries.

The challenge in designing a model for the purposes exemplified above arises when people abbreviate all or part of the name while the database contains the full name (e.g., searching for ICDM when the database contains international conference on data mining) or vice versa, change the order of the terms in the name (e.g., searching for relational probabilistic models when the database contains probabilistic relational models), enter only some (not all) terms in the name (e.g., searching for graphical models when the database contains probabilistic graphical models), shorten a long term or person’s name (e.g., searching for Bayes net when the database contains Bayesian networks or searching for Mike Brown when the database contains Michael Brown), add or remove spaces (e.g., searching for drop out when the database contains dropout), use different spellings of terms (e.g., searching for neighbor when the database contains neighbour), use a common misspelling of a term (e.g., searching for busyness when the database contains business), and have typos.

Record linkage [12], [6] refers to the problem of recognizing records in two separate files which represent identical persons, or objects. It has been previously studied independently by researchers in several areas under various names including object identification [36], entity resolution [2], identity uncertainty [30], approximate matching [16], duplicate detection [27], merge/purge [18], or hardening soft information [7]. Applications of record linkage include citation matching [15], person identification in different Census datasets [37], and identifying different offers of the same product from multiple retailers for comparison shopping [3]. The problem we study in this paper, finding the corresponding name(s) in a database for a text query, is an instance of record linkage when records are names. Hereafter, we refer to this problem as record linkage for names.

In this paper, we study the problem of record linkage for names when labeled data in the form of a set of query name, desired name pairs is available. We develop a probabilistic model for this task as a probabilistic approach facilitates the decision making process, e.g., for specifying an error tolerance and automating a portion of the queries. We experimented with several existing approaches and also developed a relational logistic regression (RLR) [22] model for this task which outperforms the existing approaches. We used RLR as it simplifies specifying and describing our model, may be extended when the dataset contains more fields, and empirically compares well to other related models [23]. The components used in our model can be computed offline with a linear pass over the dataset. The time complexity of answering queries (online phase) for the proposed method is sub-linear in the number of names in the database. Performing a search with our proposed model only takes few seconds, and our model is to be used as a front-line service for the telecommunications company.

We tested our model empirically by conducting experiments on a large real-world dataset from a telecommunications company and compared our model with several state-of-the-art models. The obtained results show how our model outperforms the state-of-the-art. We also show how the probabilities outputted by our model facilitate decision making for query automation.

The knowledge learned through our model (a list of correlated terms) can be transferred to other domains. Transferring this knowledge is especially valuable for domains where labeled data does not exist, or for domains that the amount of labeled data (or the number of businesses in the database) is not enough for machine learning purposes. To test the effectiveness of knowledge transfer for our model, we train a model on a dataset created on Yelp businesses and university names and test it on the secondString dataset [7]. The obtained results show that the knowledge our model learns on one domain can be effectively transferred to new domains.

For different domains, the aforementioned challenges for record linkage for names (e.g., abbreviation, order change, misspellings, etc.) may occur at different frequencies. For instance, in a domain containing university names abbreviations may occur frequently, while on a domain containing paper titles abbreviations may be quite infrequent. We performed a sensitivity analysis on the dataset we created on Yelp businesses and university names to measure the effect of such statistical changes and analyze how our model is expected to perform and how it compares to existing approaches on new datasets with different statistics.

Ii Related Work

Record linkage for names is a similar problem to short-text search [24, 35], where users search through documents containing short texts (e.g., considering only document titles). In short-text search, however, a query is to be matched to a document with the same meaning. E.g., for a query containing the term passion, the engine may score two documents one having the term passion and the other having love (almost) equally. Such scoring is, however, not sensible for record linkage for names. Nevertheless, many techniques developed for short-text search can be used for record linkage for names.

A classic approach for a text search problem is exact term matching: matching exact terms in the query with those in records (or documents), weighting each term according to its importance. Well-known exact term matching algorithms are TFIDF [34] and Okapi BM25 [33]. An advantage of these approaches is that they are invariant to the change of order in the terms. However, due to not being able to handle abbreviations, short forms of names, typos, spacing issues, etc., these approaches fail on a portion of queries.

Another class of approaches that can be applied to this problem are approximate string matching approaches [17]. Well-known distance functions include Levenstein [25], Monge-Elkan [27], and Jaro [21]. Each of these approaches uses a distance function that measures the dissimilarity between two strings. Using distance functions in record linkage for names may not work in many examples when the appearances of query name and the desired name are different (e.g. searching for ICDM when the database contains international conference on data mining). Furthermore, these approaches usually perform poorly when the order of terms in the query and desired document differ.

Latent semantic models [8, 19, 4, 10, 31]

aim at improving the exact-term matching approaches by converting the query and the document to a smaller space called the semantic space, and then finding the similarities in the semantic space. A query and a record can have a high similarity score in the semantic space even if they do not have any terms in common. When labeled data is not available, these models use unsupervised methods, such as SVD, to carry out the conversion, in which case, the conversion is only loosely coupled to the evaluation metric for the retrieval task. When labeled data is available, the conversion can be done supervised using, e.g., a deep neural network

[20, 29]. The labeled training set may be constructed by having an expert manually selecting the appropriate record, or may come from a user clicking a record after searching for a query.

When supervised data is available, translational approaches [13, 14] learn a term to term translation between query and documents terms. The translations are learned using a labeled dataset. Studies show when large amounts of labeled data are available, translational models can be effective [13, 14].

The model we develop in this paper can be considered as a term matching algorithm, extended with ideas from translational models to address several issues that arise in record linkage for names including abbreviations, short forms of the names, common typos, and spacing issues.

Iii Background and Notations

To be consistent with other scientific papers in this field (e.g., [20] and [26]), we use the following terminologies: A term is a sequence of letters and digits. A document is a sequence of terms. A corpus is a set of documents. A query is also a document that we are interested in finding its duplicate. A positive labeled set is a set of pairs where the document is the duplicate of the query in the corpus. A negative labeled set is a set of pairs where the document is not the duplicate of the query.

In record linkage for names, a document corresponds to a record, except that a record may have several fields but a document contains one field which is its text. A corpus corresponds to a database of records in record linkage. For two documents and , is the set of terms that are both in and and is the set of terms that are in but not in .

Iii-a Tfidf

TFIDF [34] is one of the most popular exact term matching algorithms. Beel et al. [1] reported that 83% of text-based recommender systems in the domain of digital libraries use TFIDF.

Given a query , the TFIDF score for each document being the desired document is:

(1)

TF(T, D) stands for term frequency and is computed by counting the number of times appears in . IDF(T) stands for the inverse of document frequency and measures how much information the term provides, that is, whether the term is common or rare across all documents in the corpus. IDF aims at scaling down the importance of common terms and scaling up the importance of rare terms. There are many variants, but typically, the IDF score of a term for a corpus is computed as , where is the number of all documents in the corpus and is the number of documents that have the term . Robertson et al. [32] justified this score information theoretically.

For record linkage for names, the TF part of the TFIDF is usually as each term (almost always) appears at most once in a document (e.g., we rarely see the name of a company, person, or paper having one term twice). Thus, we ignore the TF part and only use the IDF.

Iii-B Relational logistic regression

Relational logistic regression (RLR) [22] is the relational counterpart of logistic regression and the directed counterpart of Markov logic [9]. We start with some terminology:

A population refers to a set of objects. The size of a population is a non-negative number indicating its cardinality. Logical variables (logvars) start with lower-case letters, and constants denoting objects start with upper-case letters. Associated with a logvar x is a population . A lower-case and an upper-case letter written in bold refer to a set of logvars and a set of objects respectively.

A

parametrized random variable (PRV)

is of the form where is a k-ary function symbol and each is a logvar or a constant.

A literal is a PRV or its negation. A formula is made up of literals connected with conjunction or disjunction. A weighted formula (WF) is a tuple where is a formula and is a weight.

We write a substitution as where each is a different logvar and each is a logvar or a constant in . A grounding of a PRV can be obtained by a substitution mapping each of its logvars to an object . Applying a substitution on a formula (written as ) replaces each in with .

Let be a PRV whose probability depends on a set of PRVs not including . We call the parents of . Relational logistic regression (RLR)

defines a conditional probability distribution for

given an assignment of truth values to every ground PRV of , using a set of WFs only containing PRVs from :

(2)

where is the number of instances of that are true w.r.t. , and is the Sigmoid function.

Example III.1

Consider we want to find the probability of a person being happy and we know that happiness has a relation with the number of kind friends the person has such that the more kind friends the person has the happier he/she is. The model in Fig. 1 shows our theory. In this model let be an assignment of values to and . RLR sums over resulting in:

(3)

where represents the number of objects in for which is true according to , corresponding to the number of friends of that are kind. When this count is greater than or equal to 5, the probability of being happy is closer to one than zero; otherwise, the probability is closer to zero than one. Therefore, the two WFs model "someone is happy if they have at least 5 friends that are kind". Note that -4.5 and 1 are weights of two formula that are learned.

Example III.2

Suppose we want to have an RLR model as in Fig. 2 to predict : the probability of document being the result of searching for a query . Also, suppose we have a list of important terms. An RLR model may use the WFs , where is true if term is in document , and is true if is important.

Let be an assignment of truth values to all ground PRVs in and . RLR sums over the WFs in resulting in where represents the number of objects in for which is true according to , corresponding to the number of important terms that are both in and . When this count is greater than or equal to 3, the probability of being the result of is closer to one than zero. Therefore, "a document is a result of a query if they share at least 3 important terms".

Following [22, 11], we assume w.l.o.g. that formulae in WFs have no disjunction, indicate and with and respectively, replace conjunction with multiplication, and allow atoms with continuous functions in WFs (e.g., is replaced with ).

Fig. 1: An RLR model taken from Kazemi et al. [22].
Fig. 2: An RLR model for Result(q, d)

Iv A Model of Record Linkage

Let , , and be logvars corresponding respectively to terms, queries, and documents. is a Boolean PRV indicating whether document has or not (which is observed for all documents and terms), is an observed real valued PRV indicating the IDF score for terms, and is when is the desired document for .

Iv-a A probabilistic TFIDF-based model

We design an RLR model to assign probabilities to each document being the desired document for a query. We start with a basic RLR model and improve it step by step. The basic RLR model defines the conditional probability of using the following WFs: , . When both instances of are true, it contributes the weight . RLR sums over the above WFs resulting in:

(4)

where and are respectively a specific query and document. Having a positive and a negative labeled set, the weights and can be learned using gradient descent.

Iv-B Normalizing the basic model

One issue with the basic model is that for a query containing only a few terms, or only common terms, may be generally very small. For such queries, unless is unrealistically h igh, would be small even for the desired document, causing the output probability of the model to be low for such queries. Furthermore, the score of each document for a query in the basic model depends on the number of query terms that are in , not the proportion of them.

To address this issue we update the WFs to normalize the sum of the IDF scores by dividing it by the maximum IDF score a document can get for a given query. A document gets the maximum score for a query if it has all the terms in . In such a case, the sum of the IDF scores is . So we update the basic model to have the following WFs:

(5)
(6)

Where . Hereafter, we refer to the RLR model with the WFs (5) and (6) as the TFIDF model.

Example IV.1

Consider a query , the document that can get the maximum score for has all terms , , , and . in this query is:

and is as:

Example IV.2

Consider a query and a document . Then:

(7)

In which is the sum of the score that gets for query and is the maximum score a document can get for query and normalizes the .

Iv-C Adding translations

As mentioned, a TFIDF-based method may fail on documents that use different terms than those in the query. Let the PRV , where and are two logvars with population of terms, represent the proposition that term is a translation (or a part of a translation) of term . We will explain how such a PRV can be learned using positive and negative labeled sets in later sections. We assume translation relation is symmetric (if is a translation or a part of a translation of , then is also a translation or a part of a translation of ). The following WF can use this PRV:

(8)

This WF considers pairs of terms such that , , , , and is a part of translation for with probability , and gives score to the document. This WF complements the WF in (6). If a term in is also in , then the WF in (6) gives score to . If is not in but there exists a term in (which is not in ) that can be a translation of with probability , then the new WF gives score to . That is because even though does not have the exact term , with probability it has a term that corresponds to . Since this WF is complementing the WF in (6), we used the same weight .

Example IV.3

In Example IV.2, with the same query and document suppose . Then the numerator of the fraction in Eq. 7 will be summed with . Note that if , we do not add to the score as the document contains as well and we have already given score to the document.

Iv-D 1-to-many and many-to-1 translations

For some that is in but not in and that is in but not in such that , ’s translation may contain multiple terms and may be only one term in the translation of . As an example, if is ICDM, is international conference on DM and , international is only one term of the translation for ICDM; the other terms are conference, on, and DM.

The IDF formulation makes the strong assumption that each term appears in a document independently of the other terms [32]. Therefore, if is ICDM and is International conference on DM, our current WFs will give scores to for all its terms independently. This is, however, not intuitive as the terms in in such cases are highly dependent: a document containing the terms international, conference and on is much more likely to have the term DM than a random document.

To address this issue, when we learn the values of the PRV, we also compute for each term the as where is if and otherwise and is the set of all documents in the corpus . This number corresponds to the maximum number of terms in a document for which . Assuming is a PRV whose value for each is , we update the WF in 8 as:

(9)

We refer to the RLR model with WFs (5), (6), and (9) as TFIDF+TR.

Example IV.4

We expect that ICDM (for international, conference, on, data, and mining) and center (for, e.g., centre).

Example IV.5

In Example IV.2 with the same query and document, suppose we have: , , and . Then the numerator of fraction in Eq. 7 will be summed with . The denominator is same as before because it should represent the maximum score a document can get for this query for normalization and remain unchanged with this extra information about translation pairs.

Iv-E Learning

We learn and populate for each pair of terms using a regularized proportion which we threshold to give a sparse representation. To define our regularized proportion, we first provide some intuition.

Suppose is a pair in the positive labeled set. For two terms and , if and (or and ) then might be a term in the translation of . Such occurrences are positive events regarding .

Now consider the case where , and . This case reduces the possibility of being a term in the translation of as also appears in . Therefore, such occurrences are negative events regarding .

A natural way to compute is by dividing the number of positive events by the sum of the number of positive and negative events. Let: Match and Seen. Then we let , where and are pseudo-counts and are learned by cross-validation. Pseudo-counts impose a prior that a pair of terms are less likely to be part of the translation of each other, unless we see them match multiple times. The pseudo-counts allow small amounts of data to have some influence, but not much, whereas large amounts of data can overwhelm the prior.

Iv-F Adding bigrams

Spacing is an important challenge in record linkage for names: the query may contain a space between two terms where the document does not (e.g., searching for drop out where the document contains dropout) or vice versa. In order to handle such cases, we use the bigrams of the query and the documents where a bigram is a concatenation of two consecutive terms in the query or documents.

There are two cases that need to be considered:

1) Query contains the bigram: Consider a document . If a query contained the term , then and are parts of the translation of . As an off-line process which will be done once, before the queries arrive, for each document we set for if appears in at least one document in the dataset111We observed that by only considering the bigrams that appear in at least one document, the number of bigrams stored in the matrix decrease substantially while the accuracy is not affected much.. This allows us to recognize the two elements of the bigram in the document that appear in the query. Note that we do not add and to the matrix (which can be helpful when a document contains the bigram), as adding them to the matrix causes each term to have many potential translations in the matrix thus slowing down the search. Instead, we handle the case where document contains the bigram with a different approach as described below.

Example IV.6

Suppose in the corpus there is a document as drop out. As explained, we set and . So if we search for a query Q = dropout, the translation pairs helps us to find the components of the bigram dropout, which are drop and out.

2) Document contains the bigram: Suppose there is a document and and are two consecutive terms in the query , and neither of and appear in the document. Then we should also look for the concatenation of these two terms, , in the document. In order for our WFs to remain unaffected, if contains neither nor but it contains , we assume also has and .

We refer to the TFIDF+TR model after adding the bigrams TFIDF+TR+BG.

Example IV.7

Suppose we want to search for = drop out. The document = dropout will get the score of a document that has both terms drop and out for having the bigram dropout.

Fig. 3: Set differences for the hit@k of TFIDF (C1), TFIDF+TR (C2), and TFIDF+TR+BG (C3).

Iv-G Implementation

In order to implement our model efficiently, we use the following data structures. For the WF in (6), given a query , we need to score documents that have at least a term in common with . We use an inverted index which is a hash map from terms to the documents that contain that term. Then for each , we can access the documents having and update their scores. For the translations, the matrix can be very large and so needs to be implemented efficiently. We store a hash-map from terms to the set of terms that may be in their translation together with the corresponding probabilities. Then for each term , we retrieve all documents not containing but containing a term and update their score according to the probabilities. As explained, when we construct a bigram in our offline process, we add the key (with values for and ) to our only if the term exists in at least one document. This reduces the size of and the running time substantially. These models are all developed for a front-line application and produce results in few seconds for a large dataset.

V Empirical Results

In our experiments, we aim at answering three questions.

  • Q1: How does TFIDF+TR+BG compare to other existing approaches for record linkage for names and how helpful TR and BG are?

  • Q2: Can we effectively use the probabilities outputted by our models for query automation?

  • Q3:

    Is it possible to learn translations on a dataset and use them for a second dataset(i.e. transfer learning)?

  • Q4: How does TFIDF+TR+BG and other approaches perform when the statistics of the dataset change?

We design four experiments to answer each of these questions.

V-a Q1: How well does Tfidf+tr+bg perform?

To answer Q1, we compare TFIDF+TR+BG’s performance on a private dataset with several well-known benchmarks in the literature.

Dataset: Our dataset contains a list of business names of the customers of a telecommunications company. The company offers a service which requires searching a name entered by a customer in the dataset. In this application, the list of customer names is the corpus, each stored customer name is a document, and any searched customer name is a query.

The telecommunications company dataset contains approximately 650K customer names. Each month 4K different customers (queries) are searched. Currently, except for some obvious cases, the final document for a query is found or endorsed by an expert, providing a large positive labeled set with approximately 1600K pairs. We used the pairs in our positive labeled set corresponding to queries until a certain point in time as training data and the queries in the next two months (almost 8K queries) as test data. This makes the task an extrapolation task (predicting the future) that should be more challenging than interpolation. We also created a negative labeled set for learning the weights of the model by pairing a query with 5 documents other than the desired document following Huang et al.

[20]. Note that output probabilities of the model will change if we use a number other than 5 but the ranking will not.

Learning Tr: In order to learn the translation PRV , we found on a validation set that pseudo-counts and give good results. To sparsify this PRV and make its size manageable, we only keep the pairs of terms whose probability was at least 0.7, where 0.7 is also selected on a validation-set. This provided us with approximately 10K pairs of terms. Table I represents some pairs of terms learned by our regularized proportion algorithm. The translations we learned through our regularized proportion algorithm can be transferred and used for other similar tasks.

Association Service Centre Consulting
Assoc Svc Center Consultin
Assn Srvc Centr Consul
Associate Srvs Cntr Consltng
Asso Srv Cntre conslt
TABLE I: Some pairs of terms learned by our regularized proportion for : each two terms in a columns shows a learned pair.

We also tried the heuristic proposed in

[13] which in our formulation can be written as , where corresponds to the number of queries in the positive labeled set that have term . Accepting only the pairs with at least probability, this heuristic provided approximately 500K pairs of terms which severely slowed down the search engine. Furthermore, we found that the pairs generated using this heuristic are not suited for our task, and found the main reason to be the use of in the numerator. While having in the numerator may be sensible for short-text search, the following example shows why it may not be suitable for record linkage for names.

Example V.1

Let and the corresponding document be a pair of in the positive labeled set. According to the heuristic proposed in [13], pair supports and , i.e. being part of the translation for and . However, since also appears in , not only this pair should not support and , but also it should reduce these probabilities. That is because, if and were parts of the translation of , then would not have appeared in .

Fig. 4: hit@1 of the three methods TFIDF, TFIDF+TR, and TFIDF+TR+BG for different trust thresholds tt.
Fig. 5: automation percentage given the trust threshold tt for the three methods TFIDF, TFIDF+TR, and TFIDF+TR+BG.

Baselines: We compare our models to several baselines. Exact match refers to matching the query to a document with the exact name. Shared-terms scores documents based on the number of terms they have in common with the query. Levenstein and Jaro-Winkler are two well-known distance functions and are widely used for approximate string matching. For speeding these distance functions up, we only score documents that have at least one term in common with the query. [20]

propose several deep learning architectures for search.

L-WH DNN is the architecture that outperforms the other deep architectures as well as other latent semantic models in [20]’s experiments.

We learn the weights of all these model using the positive and negative labeled sets. We break the ties by picking the document that has fewer terms.

For each query in our test set, we score the documents using these methods and pick the top . Following [5, 28], for a specific value of , we define hit@k as the percentage of test queries whose desired document appears in the top retrieved documents.

Results: Table II shows the hit@k for some values of k for our models and several baseline methods on the telecommunications company dataset.

The results in Table II show that 57.83% of the test set are exact matches. Distance functions do not work well in our application since in many cases, the string of the correct match is very different than the string of the searched query (e.g., as in the ICDM vs international conference on DM). Another reason why methods based on distance functions do not work well is that they do not consider the importance of terms. As an example consider searching for the query ICDM association, where the database contains two documents 1- ICDM, and 2- NIPS association. In this case, the second document has a lower edit distance than the first document, while a document having ICDM is a better document than one having association for this query. Ignoring the importance of the terms misleads algorithms towards selecting the second document.

According to the results, deep neural network models also do not work well in our application. We found the reason to be that many terms in the customer and company names are unique and only appear a few times, therefore there is not enough data for these model to learn appropriate weights for these terms. Note that for record linkage for names, a term appearing fewer times is usually more important and carries more weight. The state-of-the-art deep learning models for information retrieval rely on learning embeddings for terms [26]. With only a few occurrences of many terms, learning appropriate embeddings for terms is difficult.

hit@1 hit@5 hit@10 hit@100
exact-match 57.83 57.83 57.83 57.83
shared-terms 88.76 92.32 93.19 95.53
Levenstein 84.16 87.23 88.16 91.29
Jaro-Winkler 88.35 91.76 92.68 95.31
L-WH DNN 75.90 80.11 81.93 88.74
TFIDF 91.33 95.24 95.95 97.32
TFIDF+TR 91.58 95.42 96.13 97.64
TFIDF+TR+BG 92.03 95.63 96.31 97.88
TABLE II: hit@k for 4 values of on the telecommunications company dataset. The winner is in bold.

Table II also demonstrates that both translations and bigrams have a positive effect on the performance of our model in terms of hit@k.

In order to better demonstrate the effect of translations and bigrams, for some value of , let , , and be the set of test queries for which the desired document is in the top retrieved documents when using TFIDF, TFIDF+TR, and TFIDF+TR+BG respectively. As grows, these sets either do not change or grow in size. However, the difference between them may differ for different values of .

Fig. 3 represents the set differences between , , and . It can be viewed that for all values of , is always bigger than and similarly for and , indicating that adding translations and bigrams always helps improve the hit@k. Note that as becomes larger, and become larger and the set differences and become quite small (5-10 queries out of the 8000 test queries). This shows that our proposed methods cover almost everything TFIDF covers. That is because, for queries with different terms than the actual document, TFIDF cannot find the desired document even for a very large , whereas the other two methods may be able to do so.

The results in Table II and Fig. 3 answer to our Q1. They indicate that TFIDF+TR+BG performs well empirically outperforming state-of-the-art approaches. They also show that translations and bigrams both have a positive effect on the accuracy of the model.

V-B Q2: Automating searching for a query

Given that our model outputs probabilities for documents being the desired document for a query, we can associate it with a utility function. One possible utility function to be combined with this model is to set a trust threshold such that if the top output of the model has a probability more than , then it is considered as the desired document without having an expert examine it, i.e. automating the process for a percentage of the queries. There are two important criteria for picking :

  • hit@1: if we trust the top outputs of a model that passes the threshold, what percentage of them will be the actual desired documents.

  • automation percentage: if we trust the top outputs of a model that passes tt, what percentage of queries will be automatically matched to a document, without expert verification.

Fig. 4 represents the hit@1 of our three models for different values of tt. It can be seen that the charts for the three models overlap and they all have similar performances and high hit@1 when tt is set high enough. Fig. 5 shows the automation percentage vs. tt. It shows both translations and bigrams increase the automation percentage for every value of tt that we tried.

Fig. 4 and Fig.5 provide the answer to Q2. They show that the probabilities outputted by our models can be effectively used for query automation. They also show that both translations and bigrams improve the automation percentage thus providing more evidence for Q1.

V-C Q3: Transfer learning

While the telecommunications company dataset is large and contains many positive pairs, in many similar applications (e.g., for smaller companies) such a dataset may not be available. In such cases, it is possible to learn the translations and bigrams over other datasets and use them for the dataset at hand (i.e. transfer learning). That is because the translations and bigrams are, for the most part, domain-independent.

In order to empirically test the transferability of translations and bigrams and answer Q3, we conduct an experiment in which we find the translations and bigrams on one dataset and then use TFIDF+TR+BG in a new domain containing unseen business names.

Dataset: We collected a set of 130K business names from Yelp and 500 university names and their abbreviations from Wikipedia. We also collected a set of equivalent names, terms with more than one spellings and common misspellings from web222This set is only used for generating a labeled set. We will not use these in testing.. Using the collected datasets, we generated a positive labeled dataset to train our model on. In order to generate train pairs, we use the following procedure. For each name in our dataset, we generate positive pairs (i.e. duplicates) where

comes from a normal distribution with mean

and standard deviation

. Each duplicate is generated as follows. With probability , some change is being applied to the duplicate and with probability the duplicate is the same as the original name. If a change is to be made to the duplicate name, with probability the whole name is abbreviated. If the abbreviation is not applied to the name, with probability one term in the duplicate name is removed from the name. The term to be removed is selected randomly with probability proportional to the frequency of the term (more frequent terms are more likely to be removed). With probability , one of the terms in the duplicate name is replaced with one equivalent form of it (e.g., center may be replaced with centre). With probability , one of the spaces in the duplicate name is selected randomly and removed from the name. With probability , two terms are selected from the duplicate name randomly and are swapped. Finally, with probability , a random typo is introduced in one of the terms in the duplicate name. We select these probabilities to be similar to the telecommunications company dataset to make them more reflect the real-world. We consider our validation set as a dataset of 9K pairs generated with this process. The test set is a completely unseen dataset from a different source than the train data source. The test set is a set of 600 business names from secondString dataset [7].

Results: We learn translations and bigrams over the collected dataset similarly to the previous section. Then we test all baselines on the secondString dataset. The results are available in Table III.

Fig. 6: hit@1 for TFIDF and TFIDF+TR+BG when (a) , (b) , (c) , (d) , (e) , and (f) varies.
hit@1 hit@5 hit@10 hit@100
exact-match 23.63 23.63 23.63 23.63
shared-terms 81.98 90.08 93.38 96.36
Levenstein 80.66 88.92 93.71 96.03
Jaro-Winkler 87.76 93.05 94.21 95.86
L-WH DNN 38.01 44.62 47.60 68.59
TFIDF 88.42 93.38 94.04 96.36
TFIDF+TR 94.21 94.71 95.04 97.19
TFIDF+TR+BG 95.54 94.87 95.20 97.35
TABLE III: hit@k for 4 values of on the business names of secondString dataset. The winner is in bold.

According to the results, one can see that the translations and bigrams learned over a dataset can be helpful for new datasets as TFIDF+TR+BG performs better than TFIDF. This experiment answers Q3 affirmatively positively. It also provides more evidence for answering Q1 as TFIDF+TR+BG performs better than the other baselines.

V-D Q4: Sensitivity to dataset statistics

In order to answer Q4, we conduct experiments to show the sensitivity of TFIDF and TFIDF+TR+BG to dataset statistics. For this purpose, we generated datasets from the Yelp businesses and university names as described earlier, each time varying one of the parameters. We randomly split the generated data into training and testing sets, train our models on the training set and testing its performance on the testing set. Each of the following charts shows hit@1 of TFIDF and TFIDF+TR+BG when all the other parameters are fixed except one:

  • Fig. 6(a) shows the hit@1 of TFIDF and TFIDF+TR+BG when varies. As increases, the dataset becomes more and more challenging and the hit@1 of both methods decrease. However, it can be viewed that TFIDF is affected more severely than TFIDF+TR+BG and the gap between the two methods gradually increases. This shows that the translations and bigrams play a positive role in making our model learn about the variations of the duplicate names and help make more accurate predictions.

  • In Fig. 6(b), the hit@1 of TFIDF and TFIDF+TR+BG are plotted when varies. The chart shows increasing makes hit@1 of both methods to decrease almost equally. That is because when more terms are being removed from the query name, finding the desired document becomes harder for both methods.

  • In Fig. 6(c), Fig. 6(d), and Fig. 6(e), the hit@1 of TFIDF and TFIDF+TR+BG are plotted when , , and vary respectively. The charts show that as or or increase, the hit@1 of TFIDF is more severely affected than TFIDF+TR+BG. That is again because of the positive role the translations and the bigrams play in learning about the equivalences, abbreviations, and spacing issues.

  • In Fig. 6(f), the hit@1 of TFIDF and TFIDF+TR+BG are plotted when varies. The chart shows that both TFIDF and TFIDF+TR+BG are not able to handle random typos.

According to the charts, TFIDF+TR+BG is expected to be an effective model for datasets with more variations in the duplicate names in terms of equivalent names, common misspellings, abbreviations, and spacing issues. This shows that our model is robust across several types of variations in the dataset. However, TFIDF+TR+BG is not expected to work well on datasets where typos occur very frequently. Extending this model to better address typos is an interesting direction for future research.

Vi Conclusion

In this paper, we studied an instance of record linkage problem for names. We developed a probabilistic model using relational logistic regression. We started with a probabilistic TFIDF-based model, then we added the possibility of recognizing two terms that are not identical but may be part of the translation of each other and also addressed the spacing issues. We tested our models on a large dataset from a telecommunications company and compared with several baselines. Obtained results indicated that our model outperforms existing state-of-the-art baselines. We showed that the knowledge learned in our model can be transferred to new domains. We also analyzed the sensitivity of our model to variations in the dataset and showed that our model is robust across several variations.

References

  • [1] J. Beel, B. Gipp, S. Langer, and C. Breitinger, “paper recommender systems: a literature survey,” International Journal on Digital Libraries, vol. 17, no. 4, pp. 305–338, 2016.
  • [2] O. Benjelloun, H. Garcia-Molina, D. Menestrina, Q. Su, S. E. Whang, and J. Widom, “Swoosh: a generic approach to entity resolution,” The VLDB Journal—The International Journal on Very Large Data Bases, vol. 18, no. 1, pp. 255–276, 2009.
  • [3] M. Bilenko, S. Basil, and M. Sahami, “Adaptive product normalization: Using online learning for record linkage in comparison shopping,” in Data Mining, Fifth IEEE International Conference on.   IEEE, 2005, pp. 8–pp.
  • [4] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” Journal of machine Learning research, vol. 3, no. Jan, pp. 993–1022, 2003.
  • [5] A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko, “Translating embeddings for modeling multi-relational data,” in Advances in neural information processing systems, 2013, pp. 2787–2795.
  • [6] P. Christen, Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection.   Springer Science & Business Media, 2012.
  • [7] W. W. Cohen, H. Kautz, and D. McAllester, “Hardening soft information sources,” in Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining.   ACM, 2000, pp. 255–259.
  • [8] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman, “Indexing by latent semantic analysis,” Journal of the American society for information science, vol. 41, no. 6, p. 391, 1990.
  • [9]

    P. Domingos and D. Lowd, “Markov logic: An interface layer for artificial intelligence,”

    Synthesis Lectures on Artificial Intelligence and Machine Learning, vol. 3, no. 1, pp. 1–155, 2009.
  • [10] S. Dumais, T. K. Landauer, and M. L. Littman, “Automatic cross-linguistic information retrieval using latent semantic indexing,” 1997.
  • [11] B. Fatemi, S. M. Kazemi, and D. Poole, “A learning algorithm for relational logistic regression: Preliminary results,” arXiv preprint arXiv:1606.08531, 2016.
  • [12] I. P. Fellegi and A. B. Sunter, “A theory for record linkage,” Journal of the American Statistical Association, vol. 64, no. 328, pp. 1183–1210, 1969.
  • [13] J. Gao, X. He, and J.-Y. Nie, “Clickthrough-based translation models for web search: from word models to phrase models,” in Proceedings of the 19th ACM international conference on Information and knowledge management.   ACM, 2010, pp. 1139–1148.
  • [14] J. Gao, K. Toutanova, and W.-t. Yih, “Clickthrough-based latent semantic models for web search,” in Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval.   ACM, 2011, pp. 675–684.
  • [15] C. L. Giles, K. D. Bollacker, and S. Lawrence, “Citeseer: An automatic citation indexing system,” in Proceedings of the third ACM conference on Digital libraries.   ACM, 1998, pp. 89–98.
  • [16] L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, D. Srivastava et al., “Approximate string joins in a database (almost) for free,” in VLDB, vol. 1, 2001, pp. 491–500.
  • [17] P. A. Hall and G. R. Dowling, “Approximate string matching,” CSUR, vol. 12, no. 4, pp. 381–402, 1980.
  • [18] M. A. Hernández and S. J. Stolfo, “The merge/purge problem for large databases,” in ACM Sigmod Record, vol. 24, no. 2.   ACM, 1995, pp. 127–138.
  • [19] T. Hofmann, “Probabilistic latent semantic indexing,” in ACM SIGIR Forum, vol. 51, no. 2.   ACM, 2017, pp. 211–218.
  • [20] P.-S. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck, “Learning deep structured semantic models for web search using clickthrough data,” in Proceedings of the 22nd ACM international conference on Conference on information & knowledge management.   ACM, 2013, pp. 2333–2338.
  • [21] M. A. Jaro, “Probabilistic linkage of large public health data files,” Statistics in medicine, vol. 14, no. 5-7, pp. 491–498, 1995.
  • [22] S. M. Kazemi, D. Buchman, K. Kersting, S. Natarajan, and D. Poole, “Relational logistic regression,” in KR, 2014.
  • [23] S. M. Kazemi, B. Fatemi, A. Kim, Z. Peng, M. R. Tora, X. Zeng, M. Dirks, and D. Poole, “Comparing aggregators for relational probabilistic models,” arXiv preprint arXiv:1707.07785, 2017.
  • [24] T. Kenter and M. De Rijke, “Short text similarity with word embeddings,” in Proceedings of the 24th ACM international on conference on information and knowledge management.   ACM, 2015, pp. 1411–1420.
  • [25] V. Levenstein, “Binary codes capable of correcting spurious insertions and deletions of ones,” Problems of Information Transmission, vol. 1, no. 1, pp. 8–17, 1965.
  • [26] B. Mitra and N. Craswell, “Neural models for information retrieval,” arXiv preprint arXiv:1705.01509, 2017.
  • [27] A. Monge and C. Elkan, “An efficient domain-independent algorithm for detecting approximately duplicate database records,” 1997.
  • [28] D. Q. Nguyen, “An overview of embedding models of entities and relationships for knowledge base completion,” arXiv preprint arXiv:1703.08098, 2017.
  • [29]

    H. Palangi, L. Deng, Y. Shen, J. Gao, X. He, J. Chen, X. Song, and R. Ward, “Deep sentence embedding using long short-term memory networks: Analysis and application to information retrieval,”

    IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 24, no. 4, pp. 694–707, 2016.
  • [30] H. Pasula, B. Marthi, B. Milch, S. J. Russell, and I. Shpitser, “Identity uncertainty and citation matching,” in Advances in neural information processing systems, 2003, pp. 1425–1432.
  • [31] J. C. Platt, K. Toutanova, and W.-t. Yih, “Translingual document representations from discriminative projections,” in

    Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing

    .   Association for Computational Linguistics, 2010, pp. 251–261.
  • [32] S. Robertson, “Understanding inverse document frequency: on theoretical arguments for idf,” Journal of documentation, vol. 60, no. 5, pp. 503–520, 2004.
  • [33] S. Robertson, H. Zaragoza et al., “The probabilistic relevance framework: Bm25 and beyond,” Foundations and Trends® in Information Retrieval, vol. 3, no. 4, pp. 333–389, 2009.
  • [34] G. Salton and C. Buckley, “Term-weighting approaches in automatic text retrieval,” Information processing & management, vol. 24, no. 5, pp. 513–523, 1988.
  • [35] A. Severyn and A. Moschitti, “Learning to rank short text pairs with convolutional deep neural networks,” in Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval.   ACM, 2015, pp. 373–382.
  • [36] S. Tejada, C. A. Knoblock, and S. Minton, “Learning domain-independent string transformation weights for high accuracy object identification,” in Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining.   ACM, 2002, pp. 350–359.
  • [37] W. E. Winkler, “Overview of record linkage and current research directions,” in Bureau of the Census.   Citeseer, 2006.