A survey of OpenRefine reconciliation services

06/19/2019 ∙ by Antonin Delpeuch, et al. ∙ 0

We review the services implementing the OpenRefine reconciliation API, comparing their design to the state of the art in record linkage. Due to the design of the API, the matching scores returned by the services are of little help to guide matching decisions. This suggests possible improvements to the specifications of the API, which could improve user workflows by giving more control over the scoring mechanism to the client.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Integrating data from sources which do not share common unique identifiers often requires matching (or reconciling, merging

) records which refer to the same entities. This problem has been extensively studied and many heuristics have been proposed to tackle it 

[7].

The OpenRefine reconciliation API111https://github.com/OpenRefine/OpenRefine/wiki/Reconciliation-Service-API is a web protocol designed for this task, which was initially implemented by the Freebase knowledge base. While most software packages for record linkage assume that the entire data is available locally, and can be indexed and queried at will, this API proposes a workflow for the case where one of the data souces to be matched is held in an online database. By implementing such an interface, the database lets users match their own datasets (which are typically smaller in size) to the identifiers it holds.

As entity matching often relies on names, the reconciliation API is essentially a search API tailored to the reconciliation problem. A typical query to a reconciliation interface consists of the name of the entity to search for, an entity type to restrict the search to a certain category of entities and a couple of other attributes to refine the search by field values. The service responds by returning matching candidates with their identifiers.

The canonical client for this API is OpenRefine222http://openrefine.org/ [13], an Extract-Transform-Load tool which can be used to transform raw tabular data into linked data. The tool proposes a semi-automatic approach to reconciliation, making it possible for the user to review the quality of the reconciliation candidates returned by the service. To that end, the reconcilation API lets services expose auto-complete endpoints and HTML previews for the entities they store, easing integration in the user interface of the client.

In this survey, we review the current ecosystem of reconciliation services. We analyze how they use the various features of the reconciliation API, review their underlying implementation when available, and propose possible changes to the protocol, making it more useful to data providers and consumers.

Acknowledgements

This work was supported by OpenCorporates as part of the “TheyBuyForYou” project on EU procurement data. This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement n°780247).

1 Overview of reconciliation

We first explain what reconciliation means and how the OpenRefine reconciliation API can be used for this process.

1.1 Goals and scope

Reconciliation consists in establishing a mapping between two sets of entities:

  • the entries of a dataset provided by the user. Such a dataset would typically contain a few hundreds or thousands of entries. For instance, the dataset could list procurement contracts between some administration and its suppliers, or a list of endangered plants in a national park, or an inventory of coins found on an archeological site.

  • the records of an authoritative online database . This database is typically larger and considered more reliable than the user dataset and generally contains unique identifiers for its entities. For instance, OpenCorporates lists about 165 million companies harvested from company registers, with their official identifiers, the International Plant Names Index stores canonical plant names and the associated scholarly information, and Nomisma curates linked open data around numismatics.

The goal of reconciliation is to guess a partial function , mapping each entry from the user dataset to the record in that represents the same entity, if any. This matching process is therefore bipartite: it is assumed that both and are free from duplicates and records coming from the same database are not compared to each other. The function is partial: it is possible that a user record does not correspond to any reference entry in . Each user record corresponds to at most one entity in .

The two databases and generally contain different fields. For reconciliation to be possible, we assume that some of them are shared by both databases. For instance, it is often the case that both databases contain names for the entities they refer to. The mapping is then constructed by comparing the values of these common attributes. These values generally differ slightly in both databases, and are ambiguous or incomplete, which is why heuristics have to be used to construct .

The motivations for reconciliation vary. One can reconcile to make the user dataset more canonical, by normalizing the references to the entities to match the authoritative data. One can also enrich the user data with additional identifiers and other attributes retrieved from the target database. Finally, it is also possible to use reconciliation as part of the curation process of the authoritative database, for instance to push data from into .

1.2 The OpenRefine reconciliation API

In this section we give an overview of the specifications of the reconciliation API. We are not aware of any formal specification for it, but an informal description can be found at https://github.com/OpenRefine/OpenRefine/wiki/Reconciliation-Service-API. We start by formalizing the data model projected by the API on the reconciliable data source.

Data model

The data source is assumed to have a set of entities , a set of possible entity types and a set of possible properties . Types provide a way to categorize entities, and properties are predicates that can be applied to entities. We will also denote by the set of character strings.

Each entity has a set of types , possibly empty. For each property , there is a set of values , possibly empty. A value can either be a character string or another entity .

For instance, if the data source is structured as a relational database, then types could be tables, entities could be table rows and properties could be table columns. Each entity would have exactly one type (the table the row belongs to) and would be empty (if the row and the column do not belong to the same table) or a singleton (the value in the intersection of the row and column).

If the data source is structured as a graph database, then entities could be nodes, types could be some sort of category system on nodes, and properties could be graph edges.

Each entity, type and property is designated by an identifier . It also has a human-readable name .

Reconciliation queries

A reconciliation query is given by:

  • a query string , representing the name of the entity queried for;

  • an optional set of types to restrict the search to;

  • a set of property values, possibly empty, giving additional information about the entity queried for.

The main task of a reconciliation service is to process reconciliation queries, sent over HTTP, and return a set of matching candidates for each query.

A reconciliation candidate is given by:

  • an entity , serialized with its id, name and types;

  • a matching score . The definition of this score is left to the service, but it is expected that the higher it is, the better the candidate matches the query.

If we denote by the set of reconcilation queries and the set of reconciliation candidates, the task of a reconciliation service is therefore to compute a function .

Suggest, preview and extend services

In addition to the main querying method, the reconciliation API also specifies ways for the service to expose suggest (or auto-complete) services, HTML previews of entities and bulk data retrieval, which ease the integration of the service in OpenRefine’s user interface.

2 Analysis of existing reconciliation services

In this section, we survey the reconciliation services currently accessible online, and outline the main technical choices behind them. We compare them to the general techniques found in the literature on record linkage. Although there are many different approaches to record linkage, most of them broadly follow a common architecture [11, 7]. Figure 1 provides a summary of the characteristics of the reconciliation services studied.

Candidate retrieval

First, potential matches from the target database are selected. It is generally assumed that the user data contains a name for the entity. In the case of companies, this name would ideally be the official company name, but could also be an acronym, a trademark or any other name under which the company is or was known informally. This name is generally the primary discriminative information at this stage, if not the only one. A short list of candidates with similar names are retrieved from the target database, generally using a search index. The approaches for this step are reviewed in Section 2.1.

Field scoring

Second, each field supplied by the user is compared against the corresponding values in the target database. The degree of similarity of each value pairs is usually represented by a boolean or numeric score. The nature of the scoring method depends largely on the type of information stored in each field. In Section 2.2, we survey field scoring techniques and their use in reconciliation services.

Matching decision

Third, the field scores are used to determine which of the candidates (if any) will be retained as the matching entity. This critical step is often broken down into two tasks: first, aggregate all the field-level scores into one global matching score for each candidate in the short list. The crucial decision at this stage is to balance the influence of each field on the final matching score. In Section 2.3 we give an overview of the wide range of approaches that have been proposed to determine these weights. Once this global score is defined, the final decision whether to match a candidate to the user data is generally based on a threshold for the global score, determined by the risk associated with false positives and false negatives.

Name Types Properties Retrieval Name score Property score Global score
OpenCorporates single (company) - jurisdiction
- date
ElasticSearch and SQL LIKE on name Combination of domain-specific rules and Levenshtein distance Boolean on jurisdiction, fixed penalties on date Weighted sum of field scores
IPNI single (scientific name) 19 properties Lucene index Boolean matchers configured for each name and property, with canonicalization Average of field scores adjusted for blank fields
FindThatCharity.uk single (charity) None ElasticSearch N/A ElasticSearch’s score
Nomisma 23 types 8 properties Solr N/A Solr’s score
VIAF 5 types None VIAF search API Levenshtein distance ratio N/A (Name score only)
OpenLibrary single (book) Unspecified OpenLibrary search API, concatenating property values to name N/A Constant (1.0)
ORCID single (person) None ORCID search API N/A “relevancy” score returned by the search API
Wikidata Wikidata items used with instance of (P31) and subclass of (P279) All Wikidata properties (a few thousands) Wikidata search APIs Levenshtein-based fuzzy metric Defined by property datatype Weighted average of field scores
lobid-gnd 8 types All properties from the GND ontology ElasticSearch, concatenating property values to name N/A ElasticSearch’s score
GODOT single (person) None Direct comparison to all records Levenshtein-based fuzzy metric N/A (Name score only)
OCCRP 8 types defined by the schema ElasticSearch N/A ElaticSearch’s score
Figure 1: Main characteristics of the surveyed reconciliation services

2.1 Candidate retrieval

The choice of restricting the matching heuristics to a short list of candidates is a simplification to reduce the computational cost of reconciliation. Instead of comparing the user record to each database entry, these comparisons are restricted to the most relevant entries, fetched by a coarse-grained but computationally efficient filtering method.

The usual way to perform this filtering is by maintaining an index on one or more fields of the database. This indexing is often called blocking [8, 7]: the records are partitioned into blocks (or buckets) depending on their values. Given a query, we compute the corresponding block value (or multiple blocks) and only retrieve candidates from these blocks. For instance, indexing names with a phonetic transcription such as Soundex will map the names “Will” and “Wil” to the same code W400.

Common solutions involve building an inverted index which can be used to retrieve all candidates containing words in the query [1]. Various techniques have been developed to make this retrieval more error-tolerant: for instance, indexing based on q-grams (sequences of characters) instead of words makes it possible to retrieve misspelt records [4, 7].

In reconciliation services

Existing services overwhelmingly rely on traditional search engines for candidate retrieval, and a Lucene-based index is the most common choice (both ElasticSearch and Solr rely on Lucene). This holds both for services hosted by the original data provider, which can query their own search engine directly and for services implemented by third-parties on top of the generic API exposed by the data provider.

Many of the advanced indexing techniques and linguistic preprocessing mentioned in the literature are available in Lucene. In this context, improving this candidate retrieval step consists in tuning the configuration of the indices to the type of data they are used for. It is an area worth investing effort as any improvement benefits not only the reconciliation service but also all the other services relying on the search engine (seach as any search UI proposed to end users).

The only exceptions to this are the GODOT reconciliation service, where no candidate retrieval phase is done (all records are compared to the query), and OpenCorporates where some queries rely on SQL search.

2.2 Individual field scoring

In this section, we review various scoring methods for individual fields.

In the absence of unique identifiers, the name of an entity is the primary discriminative clue to identifiy it. Therefore, scoring methods for entity names have attracted a lot of attention. They fall into three families depending on which basic comparison unit they use: characters, q-grams or words.

Character-based metrics quantify the minimal number of operations on individual characters to transform a string into the string it is compared to [7]. The nature and cost of the operations involved depends on the algorithm. The simplest version is called the edit distance: the allowed operations consist in deleting, inserting or replacing characters. Although the search space of editing operations is large, it is possible to compute this distance quickly with the Levenshtein algorithm: the number of operations is proportional to the product of the lengths of the strings compared. Many variants of this metric have been introduced: adding operations to modify larger groups of consecutive characters (for instance to soften the effect of a missing word or shortened word on the score), giving different weights depending on the characters substituted (to account for replacements of similar characters such as O and 0) [15] or speeding up the comparison by restricting the number of changes [14].

Q-gram based metrics compute all the sequences of consecutive characters in each string, and compare them. For instance, the word “Oracle” contains the 3-grams “Ora”, “rac”, “acl” and “cle”. Although not as precise as an edit distance, comparing the sets of q-grams contained in strings is a simple and inexpensive way to assess to which extent they differ. It accounts for word reorderings and can also be used for indexing. Like character-based metrics, q-grams are mostly useful when the strings differ by spelling mistakes or encoding errors.

Word-based (or token-based) distances first separate the input into words and use these as basic comparison units. Working at word level gives a more semantic notion of similarity, without conflating words with similar spellings but unrelated meanings. It is still possible to add stemming and other normalization procedures to each word to account for some controlled variation in word spellings.

For both Q-gram and word-based approaches, there are various methods to turn a set of common units into a score. The simplest way is to count the number of matching units and divide it by the total number of units in both strings, for instance. However, not all tokens in a name are equally informative. For instance, the similarity between “Greentech Distribution” and “Greentech Services” should be higher than that of “Greentech Distribution” and “Globafrik Distribution”, simply because having “Greentech” in common is more informative than “Distribution”.

The standard solution to this problem is called TF-IDF (Term Frequency - Inverse Document Frequency). Informally, this is a method to measure the significance of a word occurence in some text. The significance is proportional to two factors: how often the term appears in the given document and how rarely it appears in general in other documents. In the context of name matching, the documents are very short as they are the names themselves, so term frequency does not play an important role. However, inverse document frequency is a decisive factor which will give more significance to “Greentech” than to “Distribution”. SoftTFIDF [10] is a method to use TF-IDF as a string similarity measure. In its simplest version, it is simply defined as

where are the strings to compare, ranges over the common words in and , and is the TF-IDF weight of in . In its full version, SoftTFIDF also allows for some dissimilarity between words by incorporating a word similarity metric in each summand.

In reconciliation services

Many reconciliation services skip field scoring altogether by returning a global score computed by the search engine used for candidate retrieval. The reconciliation services that do compute a matching score for the name or other textual fields generally use Levenshtein-based metrics or more conservative exact matching.

Advanced heuristics such as SoftTFIDF could help surface more relevant candidates. For services hosted by the data provider itself and run on top of a search engine, it is likely that TF-IDF scores can be obtained along with the candidates (for instance with ElasticSearch’s explain=true mode).

2.3 Global scoring methods

Once fields from the reference database have been compared with user data, we need to draw on these comparisons to decide whether to match the user record to a reconciliation candidate. Users have various expectations about this step and it is crucial to accomodate them.

First, reconciliation is an inherently approximate process and the accuracy to aim for depends on the application: the cost of false positives (erroneously matching a user record to a reference identifier) and false negatives (erroneously declaring that the user record does not correspond to any reference identifier) varies. Many record linkage methods let the user influence these error rates by computing a global matching score. The user can then set their own threshold on the matching score to get the desired trade-off between false positives and false negatives. However, in the absence of reference data to evaluate these error rates, the impact of the threshold on errors is often unknown.

Second, the notion of identity between the user data and the reference database is not always the same. For instance, when reconciling companies from a list of bids for a market, users might want to match each company to the exact legal entity who submitted the bid, or to a better known larger entity controlling the bidder. This means that the relative influence of fields such as the headquarters’ location might also need to vary. Giving the user some control over the global scoring method is also important to let them factor in the reliability of their data in each field.

Given a collection of features comparing a user record to a reference record, there are various ways to build a decision function which predicts whether the records refer to the same entity.

Linear models

Features are often boolean or numeric, and the simplest way to combine them into one score is to use a linear combination of the features. The higher this weighted sum gets, the more confident the system is that the two records represent the same entity. Many probabilistic approaches to record linkage such as that of Fellegi and Sunter [12]

also fall into this category: the score corresponds to a probability of match and it is log-linear in the feature values. The decision whether two records are considered as matches or mismatches is then taken by comparing the confidence score to thresholds. In probabilistic models these thresholds can be determined by desired false-positive and false-negative rates 

[11].

Decision trees

Decision trees define simple decision procedures to decide whether two records match, without implicitly defining a global matching score [9]. Starting from the root of the tree, the decision process visits nodes. Each internal node is associated with one feature and a threshold to compare it against. The comparison determines which node to visit next, and the process terminates when a leaf is visited, which contains a matching decision. One important aspect of decision trees is that they are easy to define and interpret for users.

Other classifiers

Deciding if two records refer to the same entity is a binary classification problem so many other classes of decision functions can be used to tackle this problem. Generic machine learning tools such as Support Vector Machines or K Nearest Neighbours have been used in this context 

[6, 7].

In reconciliation services

Again, many reconciliation services avoid developing their own scoring mechanisms by simply exposing the score exposed by the underlying search engine. When the service is run by the data provider itself, the configuration of the search index can be adjusted to make this score more useful.

When scoring is done explicitly in the service, linear models are the most widespread choices in reconciliation services, due to their simplicity and their ability to aggregate evidence from various features. However, given the partial view reconciliation services have on the user data, a probabilistic approach is difficult, making it hard to set weights and thresholds in a principled way.

When matching or unmatching sets of rows selected by facets, OpenRefine users are effectively building an implicit decision tree in their operations history. However, given that the field matching scores are not exposed to the user, this work often involves re-computing locally similar features (such as edit distances between labels). See Section 3.2 for proposals to improve this.

We are not aware of any use of advanced machine learning techniques in combination with OpenRefine or its reconciliation API. The limiting factor for this is again the unavailability of field matching scores, which we also propose to solve in Section 3.2.

3 Improving reconciliation workflows

OpenRefine reconciliation is designed to solve a particular form of record linkage problem. It was originally designed to work with Freebase, a collaborative knowledge graph. In this context, users would align datasets that they want to upload to Freebase by matching the entities in their table to existing Freebase topics, so that the information that they upload builds up on previous contributions by improving existing topics and creating new ones. The reconciliation process has then been generalized to work with arbitrary target databases, by specifying a dedicated web API that the database must expose

[2].

In this section, we describe what the current reconciliation workflow looks like, what its limitations are, and how it could evolve to accomodate better for users’ needs.

3.1 Current reconciliation workflow

OpenRefine lets users link their tables to target databases such as OpenCorporates. This works by selecting a column, containing names of the entities to match, and configuring the reconciliation process as shown in Figure 2.

Figure 2: OpenRefine’s user interface for reconciliation configuration

Users can choose to restrict the reconciliation candidates to records of a particular type. This notion of type is defined by the target database, each record they hold can have multiple types, each of which is defined by an identifier and accompanied by a human-readable name. Beyond the column containing names, it is possible to use other columns of the table by matching it to fields of the target database. To this end, the target database must expose a vocabulary which lists the fields that user data can be matched against.

Once reconciliation has been configured, OpenRefine will make a series of API calls to the reconciliation service, each call containing a small batch of reconciliation queries. A reconciliation query consists of the values in the columns used for reconciliation (main column for the name and auxiliary columns for other fields) as well as the chosen type to restrict reconciliation candidates to (if any). For each reconciliation query, the reconciliation service returns a list of candidates. Each candidate is supplied with a unique identifier from the database, a human-readable name, a list of types and a matching score.

This matching score is produced by the reconciliation service on the basis of the information supplied in the query and is typically opaque - users do not necessarily know how scoring works. In particular, users do not have any easy way to influence the importance of a given field, or to inspect the matching scores of individual fields.

By using facets, it is then possible to take matching decisions for rows matching certain criteria. These criteria can depend on matching scores, types of the candidates, or any other value in the table.

3.2 Exposing field-level scores in the reconciliation API

The main limitation of this workflow is the lack of control on the scoring mechanism. As a user, it is hard to rely on an opaque score to create a reliable reconciliation workflow. Even if the scoring function is publicly documented, it might not be suitable for all datasets. As a reconciliation service provider, coming up with a scoring mechanism that works for everyone is impossible, especially because the final matching decisions made by users are not communicated to the reconciliation services: it is impossible to learn the scoring function from data unless a dedicated dataset is annotated separately. Such a dataset is hard to compile given the wide variety of use cases and user data sources that reconciliation endpoints are typically exposed to.

Candidate retrieval

Field scoring

Global scoring

Matching decision

1

Offline matching

2

Manual matching via search API

3

Proposed new reconciliation API

4

Current OpenRefine reconciliation API

5

Server-side dataset matching

Server side

Client side
Figure 3: Five possible boundaries of responsibilities between server and client in a reconciliation process

To solve this problem, we need to shift the boundary between the responsibilities of the service provider and the user in the reconciliation process. Figure 3 shows a diagrammatic representation of the reconciliation process, with various options as to where the reconciliation API should sit. Dashed lines represent the separation of responsibilities between client on the right (the user who supplies the data to match) and the server on the left (the reconciliation service which exposes the database to be matched against) in various scenarios. Each of these scenarios has important implications in terms of usability, performance and quality that we analyze below.

1 Offline matching

This consists in downloading a copy of the target database and performing the reconciliation process locally. It is generally necessary to build indices on the database first, transform it to a different format, and write some custom matching heuristics. Off-the-shelf record linkage tools such as Duke333https://github.com/larsga/Duke, the R package RecordLinkage [17] or Serf [5] can also be used. This workflow can be necessary when the user dataset to be matched is large, as it minimizes data transfer between the user and the database. However, for users with small datasets or simple matching needs this workflow is completely impractical when the target database is large, as in the case of OpenCorporates.

2 Manual matching via search API

This workflow is fairly widespread, as many online databases offer a web API that can be used to search for records using various criteria. It is then up to the user to decide how to compare the records returned to their own data. If the API exposed by the service is flexible enough to retrieve the appropriate candidates efficiently, this can be viable but a custom reconciliation process must be implemented by the user, which is a significant investment. The API often hides valuable statistics from the search engine, such as those needed to compute TF-IDF scores.

3 Proposed new reconciliation API

We propose to improve the existing reconciliation API used in OpenRefine to let data providers expose matching scores for individual fields, instead of just one global score. This would let clients use their own global scoring methods, which would give the appropriate weight to each field. Handling field scoring server-side would let the reconciliation services implement metrics that are meaningful in their domain (such as the bespoke name matching heuristics used in OpenCorporates for company names) without having to implement this domain-specific expert knowledge in generic client-side tools. In this configuration, field-level scores can also depend on statistics maintained in the search engine of the database, making it possible to use TF-IDF scores for instance. We explore the implications of this proposal in Section 3.3.

4 Current OpenRefine reconciliation API

As explained in Section 3.1, global candidate scoring is currently the responsibility of the reconciliation service, making it impossible for users to influence how this score is computed. Another way to solve this problem would be to let the user specify more parameters in their reconciliation queries, such as providing a numerical weight factor for each of the columns supplied. The main downside with such an approach is that it would be hard for the user to come up with these weights initially, and changing the weights a posteriori would require running again the reconciliation process (which costs time and resources). This would make it hard to integrate the API with any machine learning approach.

5 Server-side dataset matching

In this scenario, the user would upload their dataset to the reconciliation service, which would perform the matching of all rows in one go and return the final results. This would have the advantage of eliminating round-trips between the client and server, but would make it hard for the user to finely tune the reconciliation heuristics. Providing reference matching decisions to the reconciliation service would also be hard as the reconciliation candidates would not be known in advance.

3.3 Evolution of the reconciliation API

As motivated by our analysis of the various scenarios above, we propose changes to the reconciliation API and evaluate the impact on service providers, API clients and end users.

For a service provider, the proposed change would imply changing the format of the responses returned to include the matching scores for each field. The specification of the API could potentially allow for multiple scores per field, which would let services expose different scoring heuristics. The different scores returned could also be independent from the fields supplied: the API would require services to return an arbitrary list of feature values. In order to be compatible with existing clients, it might be useful to require the services to still return a global score as well. This global score would serve as a default and could be ignored by clients which can rely on the individual features instead. These details and the concrete format of these responses should be proposed for consultation with the community to ensure that the initiative is followed by as many stakeholders as possible.

For an API client, such as OpenRefine itself, this proposal would imply some changes to at least store and expose the feature scores. OpenRefine already has a dedicated field to store features associated with a particular reconciliation candidate but these features are computed locally and are therefore very generic and not easily adaptable to particular domains. More importantly, clients need to include tools to help users build appropriate decision functions for their dataset. This could be achieved by integrating machine learning packages developed in other tools, which would give the user a real control over the error rates and abstract away the features. The existing manual matching capabilities could be reused to provide training data to these automated approaches. Active learning has been applied to record linkage to learn classifiers with small quantities of training data 

[18, 16]. Active learning works by incrementally improving a classifier with new training examples, selected from cases where the classifier has the least confidence. This learning paradigm can be used with a wide range of types of classifiers [3] and could be an interesting complement to the exploratory data analysis workflows encouraged by OpenRefine’s design.

For users, the reconciliation process must remain accessible and simple. It should be possible to work with a stock global scoring method whose performance should be comparable to the current scores. Exploring the values of these features should be possible with facets, and features should be documented so that users can understand the meaning of reconciliation results. Finally, in a scenario where machine learning is used, it should ideally be possible to expose the learned decision function to the user, for instance as a decision tree or by drawing the decision boundary on a scatterplot. It could also be useful to let the user interact with this learned model and tune it with their own knowledge of the data.

Conclusion

We have surveyed the existing reconciliation services and compared them to the state of the art in recorcd linkage. From this review, we suggest possible changes to the reconciliation API. We propose to make it possible for reconciliation services to expose field matching scores in addition to (or instead of) global matching scores for each candidate. The initial motivation for this change is to make it possible for users to balance the importance of each field, but the implications are much broader as this change would make it possible to reuse a wide range of advanced classifiers from the literature. With the appropriate integration in OpenRefine, this would help users build reliable matching heuristics, informed by their expert knowledge of the data. This change would also benefit any other API user who could feed these features to the machine learning packages of their choice.

References

  • [1] Elasticsearch from the Bottom Up, Part 1. https://www.elastic.co/blog/found-elasticsearch-from-the-bottom-up, September 2013.
  • [2] Reconciliation Service API. https://github.com/OpenRefine/OpenRefine, November 2018.
  • [3] Arvind Arasu, Michaela Götz, and Raghav Kaushik. On active learning of record matching packages. In Proceedings of the 2010 International Conference on Management of Data - SIGMOD ’10, page 783, Indianapolis, Indiana, USA, 2010. ACM Press.
  • [4] Rohan Baxter, Peter Christen, and Tim Churches. A Comparison of Fast Blocking Methods for Record Linkage. page 6, 2003.
  • [5] Omar Benjelloun, Hector Garcia-Molina, David Menestrina, Qi Su, Steven Euijong Whang, and Jennifer Widom. Swoosh: A generic approach to entity resolution. The VLDB Journal, 18(1):255–276, January 2009.
  • [6] Peter Christen. Automatic record linkage using seeded nearest neighbour and support vector machine classification. In Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD 08, page 151, Las Vegas, Nevada, USA, 2008. ACM Press.
  • [7] Peter Christen. Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer Science & Business Media, 2012.
  • [8] Peter Christen. A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication. IEEE Transactions on Knowledge and Data Engineering, 24(9):1537–1555, September 2012.
  • [9] Munir Cochinwala, Verghese Kurien, Gail Lalk, and Dennis Shasha. Efficient data reconciliation. Information Sciences, 137(1):1–15, September 2001.
  • [10] William W Cohen, Pradeep Ravikumar, and Stephen E Fienberg. A Comparison of String Metrics for Matching Names and Records. page 6, 2003.
  • [11] Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. Duplicate Record Detection: A Survey. IEEE Transactions on Knowledge and Data Engineering, 19(1):1–16, January 2007.
  • [12] Ivan P Fellegi and Alan B Sunter. A theory for record linkage. Journal of the American Statistical Association, 64(328):1183–1210, 1969.
  • [13] David Huynh, Tom Morris, Stefano Mazzocchi, Iain Sproat, Martin Magdinier, Thad Guidry, Jesus M. Castagnetto, James Home, Cora Johnson-Roberson, Will Moffat, Pablo Moyano, David Leoni, Peilonghui, Rudy Alvarez, Vishal Talwar, Scott Wiedemann, Mateja Verlic, Antonin Delpeuch, Shixiong Zhu, Charles Pritchard, Ankit Sardesai, Gideon Thomas, Daniel Berthereau, and Andreas Kohn. OpenRefine. 2019.
  • [14] Gad M Landau and Uzi Vishkin. Fast parallel and serial approximate string matching. Journal of Algorithms, 10(2):157–169, June 1989.
  • [15] Saul B. Needleman and Christian D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48(3):443–453, March 1970.
  • [16] Sunita Sarawagi and Anuradha Bhamidipaty. Interactive Deduplication using Active Learning. page 10, 2002.
  • [17] Murat Sariyar and Andreas Borg. The RecordLinkage Package: Detecting Errors in Data. 2:7, 2010.
  • [18] Sheila Tejada, Craig A. Knoblock, and Steven Minton. Learning domain-independent string transformation weights for high accuracy object identification. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’02, page 350, Edmonton, Alberta, Canada, 2002. ACM Press.