Fast Record Linkage for Company Entities

07/19/2019 ∙ by Thomas Gschwind, et al. ∙ ibm 0

Record Linkage is an essential part of almost all real-world systems that consume data coming from different sources, structured and unstructured. Typically no common key is available in order to connect the records. Often massive data cleaning and data integration processes have to be completed before any data analytics and further processing can be performed. Though record linkage is often seen as a somewhat tedious necessary step, it is able to reveal valuable insights of the data at hand. These insights guide further analytic approaches over the data and support data visualization. In this work we focus on company entity matching, where company name, location and industry are taken into account. The matching is done on the fly to accommodate realtime processing of streamed data. Our contribution is a system that uses rule-based matching algorithms for scoring operations which we extend with a machine learning approach to account for short company names. We propose an end-to-end highly scalable enterprise-grade system. Linkage time is greatly reduced by efficient decomposition of the search space using MinHash. High linkage accuracy is reached by the proposed thorough scoring process of the matching candidates. Based on two real world ground truth datasets, we show that our approach reaches a recall of 91 results are achieved while scaling linearly with the number of nodes used in the system.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Enterprise artificial intelligence applications require the integration of many data sources as well as the ability to link data among these data sources, In such applications, one of the most important entity attributes to be linked across the data sources is the company name. It acts as a “primary key” across multiple datasets such as company descriptions, marketing intelligence databases, ledger databases, or stock market indexes. The technique used to preform such a linkage is commonly referred to as

record linkage or entity matching.

Record Linkage (RL) has been extensively studied over the last decades. RL is in charge of joining various representations of the same entity (e.g., a company, an organization, a product, etc.) residing in structured records coming from different datasets. It was formalized by Fellegi and Sunter in 1969 [13] and the tutorial from Lise Getoor [14] provides an excellent overview of the use cases and techniques. Essentially, RL has been used to link entities from different sets or to deduplicate/canonise entities within a given set. To this extent, several approaches have been envisaged ranging from feature matching or rule-based to machine learning approaches.

Typically, RL is performed in batch mode to link a large number of entities between two or more databases [14]. A challenge in enterprise applications is the ever increasing amount of unstructured data like news, blogs, social media content etc. and the need of its integration with enterprise data. As a consequence, RL has to be performed between structured records and unstructured documents. This large amount of data may flow in streams for rapid consumption and analysis by enterprise systems. For example, if we consider the analysis of news articles, it is not uncommon to have to process about one million news articles per day. If we consider even only one company name per news, we will have to perform at least ten linkages per second in order to preserve the timeliness of the overall application. Therefore, RL needs to be executed on the fly and with stringent time constraints. Moreover, “real-time” RL is also required in other uses cases such as auto-completion or input suggestions in user interfaces.

In this work we assume that the entities to be linked in unstructured data are identified as mentions by a ready-to-use Named Entity Recognition (NER) module such as 

[1]. These entities are then passed to RL in a structured fashion in the form of a record containing attributes that, for example, represent company names, locations, industries and others. RL is in charge of linking this record against one or multiple reference datasets. We will focus on rule-based RL as we have only very limited training datasets which would be required to investigate the use of machine learning techniques in this context.

Reference datasets are typically relatively large and may contain several tens (or even hundreds) of millions of records. When matching an incoming record, it would be prohibitive to perform an extensive comparison to each individual entry in the reference dataset. The most commonly used approach [14] is to decompose the reference dataset into blocks by means of Locality Sensitive Hashing (LSH) function. Each block is identified as a set of records having the same hash value, obtained by applying the hashing function to one or more attributes of the record.

The main contributions of this work are:

  1. We present an end-to-end comprehensive RL system which is highly scalable and provides an enterprise-grade RL service.

  2. We study scoring functions for various attribute types and, in particular, we propose a scoring technique for company names taking into account their various properties. Also, we propose a hierarchical scoring tree which allows the efficient and flexible implementation of multi-criteria scoring functions.

  3. We study and apply automatic short company name extraction based on Conditional Random Fields as a sequential classification method. Short names are treated as one of the important features of a company entity.

  4. We evaluate different LSH configurations, as well as our system based on two real world ground truth datasets.

The remainder of this paper is organized as following. Section 2 discussed the general background of RL and presents related work. Section 3 describes the proposed system in detail. Section 4 presents our approach to the problem of short company names. The performance of the proposed system is discussed in Section 5 and Section 6 presents future research directions and concludes the paper.

2 Background

Various systems to perform record linkage have been proposed over the last decades [16, 22, 26, 6, 23]. As mentioned in the introduction, they can be usually divided to the rule-based and machine learning based systems. Here we would like to point out a couple of the most related and interesting to our current and future work. Konda et al. [22] have proposed a system to perform record linkage on a variety of entities, providing a great flexibility in defining the linkage workflow. This system allows the user to select the various algorithms being used at various stages of the linkage process. Despite this flexibility, this approach does not address the performance problem which is at the center of the class of applications we are addressing. The Certus system, proposed in [23], exploits graph differential dependencies for the task of RL. Even though there is no need for an expert to manually create these graphs, an essential amount of training data is still needed to leverage the graphs automatically. However, we cannot apply such techniques as we consider cases where the amount of training data is very limited.

2.1 Locality Sensitive Hashing Methods

In the domain of RL, Locality Sensitive Hashing (LSH) methods are generally used to provide entities with signatures in such a way that similar entities have identical signatures with high probability 

[32]. These signatures are commonly referred to as blocking keys which denote blocks. Blocks are used to limit the number of comparisons needed during the scoring phase, where candidate entities are compared in detail. A given query entity is only scored against the set of entities that share the same blocking keys.

It is difficult to be able to correctly identify similar entities based exclusively on a single hash function. Actually, depending on the entity set, a hash function may be too – or not enough – discriminative to correctly identity similar entities. Therefore, if only one blocking key per entity was to be generated, a fraction of similar entities might not map to the same blocking key. Consequently, potential matches would be classified as false negatives. To alleviate this problem, LSH uses a set of hash functions that are used to compute the associated blocking keys. Classification into blocks is not done using a single blocking key but a set of blocking keys.

Originally LSH algorithms were developed to compare documents, that sets of a significant number of words. In our application, we consider entity names consisting of only a few words (e.g. company name) and, therefore, we need to artificially generate a “document”. To this end we decompose the entity name in n-grams to obtain the “words” of the “document”. In the remainder of this Section we use the terms

elements and set. LSH aims at maximizing the probability that two similar sets share common hash keys. Two commonly used algorithms for this purpose are MinHash and SimHash.

MinHash is an algorithm for set similarity based on the Jaccard measure [8, 7]. A common MinHash setup uses a chosen number of random hash functions . To generate MinHashes for a given set – i.e. set of hash keys – we apply the hash functions to each element of the set and retain only the minimal value , . The MinHashes will then be the ordered list of the minimal values .

To effciently compare the MinHashes of two sets, a decomposition in bands and rows is typically applied [21]. To this end the list in decompsed into bands, being a multiple of . Each band contains rows. The elements of each band are usually hashed into a dense representation.

A fundamental property of MinHashes is that the Jaccard similarity of the MinHashes generated for two sets is equal to the Jaccard similarity of those sets.

It can be shown that the probability for a desired minimal Jaccard similarity of is . This is commonly referred to as the “S-curve” in the MinHash literature. This measure allows to quantitatively describe the characteristics of a MinHash configuration [21].

Over the last decade, a number of efforts have been undertaken to improve the performance of MinHash in terms of data size [25] and in terms of speed [25, 12, 33, 35].

SimHash

is another well known LSH algorithm which is based on cosine similarity. A SimHash for a given set is accomplished as a superposition of weighted hash codes for each element in the set. The resulting SimHash is represented as a bit-sequence of a certain length defined by the chosen hashing algorithm 

[9, 28]

. The collision probability of two SimHash signatures is a function of the angle between the vector representations of their bit-sequences 

[34]. SimHashes are typically compared by using the Hamming distance measure. There are efficient solutions applicable for SimHash to solve the challenge of quickly finding all signatures that differ from a given signature in at most bit positions, where is a small integer [17].

Both MinHash and SimHash generate large amounts of hash keys and, to obtain a good lookup performance, they rely on an efficient in-memory storage. Each approach has distinct advantages. MinHash has the following properties: (i) it allows for more variability between the feature sets, (ii) it has quality guarantees that can be explicitly computed and (iii) therefore allows to determine parameter values easily. Furthermore, the algorithm is from the public domain. In comparison, SimHash provides more flexibility to weight individual features in the set and has a more compact representation of the entity signatures because less hashes might be needed [28].

In our current setup, we chose to use MinHash in order to leverage explicit computation of the parameters. Moreover it can be tuned for high recall (i.e., maximize the number of relevant entities) which is a prime requirement for RL.

2.2 Machine Learning for Record Linkage

A first approach to use machine learning techniques for record linkage was proposed in 2003 by Elfeky et al. [11]

. In this work, a trained classifier approach is compared to unsupervised clustering and to a probabilistic approaches. Although the trained classifier outperforms the other approaches, the authors mention the difficulty of obtaining training data. More recently, studies have been conducted to assess the applicability of neural networks to record linkage as for example 

[15, 29]. In particular, Mudgal et al. [29]

show that, compared to “classical” approaches, deep learning brings significant advantages for unstructured and noisy data while it only achieves marginal improvements for structured data.

The major limitation for the use of machine learning techniques in record linkage is the difficulty of finding sufficient annotated training data. This is especially true with company names. Moreover, for each new reference dataset introduced in the system a new specific training dataset has to developed. To alleviate this problem, some promising approaches like the use of active learning 

[30]

have been proposed. However, the application of machine learning techniques to record linkage remains at the moment limited. In this work, we apply machine learning for RL through the step of short name extraction from a conventional company name (Section 

4).

3 Record Linkage System

As discussed in the introduction we consider the problem of RL done on the fly, i.e. dynamically linking an incoming record to records in one or more reference datasets. A record is defined as a collection of attributes, each of them corresponding to a column in the dataset. Typically attributes are company name, street address, city, postal code, country code, industry, etc. Note that different reference datasets might not contain the same attribute types and/or attributes might be referenced by different names. For the latter, we assume that the attribute names are normalized.

We accomplish the linkage by preprocessing the reference datasets using an offline preprocessing pipeline. Once the datasets have been preprocessed they are used by our runtime pipeline which is responsible for matching incoming records against candidate records and returning the best matches. These two pipelines are shown in Figure 1.

Figure 1: Preprocessing and runtime pipeline.

3.1 Preprocessing Pipeline

The preprocessing pipeline reads records from a given source format, “cleans” them, and generates a binary database that supports the efficient retrieval of the records. Once the binary database has been generated, a blocking key database is built by “cleaning” each record again and computing for each record a set of blocking key values corresponding to an LSH function. As discussed in Section 2, our implementation uses MinHash [7, 8] as LSH function.

The generation of a MinHash for an entity feature value essentially encompasses the following steps: (i) cleaning, (ii) shingling into a set of n-grams, (iii) retrieval of n-gram vocabulary indices, (iv) computation of MinHashes for each chosen random hash function, (v) grouping MinHashes into bands, (vi) hashing band MinHashes, (vii)

adding band MinHashes to blocking key database. The band MinHashes are hashed using a general purpose hash function with an optimal uniform distribution. In our setup, we used the 64-bit Murmurhash3 

[18].

The blocking key database stores for each blocking key the corresponding record indices. Our method provides two kinds of cleaning operations. We have a “light” cleaning operation that is applied to attributes that are used to improve the scoring performance of candidate records. These attributes are stored in both the original unmodified version and the cleaned version in the binary record database. This cleaning operation converts accented characters into their decomposed form in the Unicode representation [2], maps characters to their lowercase equivalent, replaces punctuation with spaces, and collapses multiple consecutive spaces into a single space.

In earlier versions of our RL system we applied a heavier data cleaning such as removing accented characters, and also the legal entity type of a company such as “inc” and “ltd” since this information is not always present in a company name. However, we quickly identified cases where this type of cleaning created ambiguities that could no longer be resolved. This observation has also been reported in [31]. Hence, rather then eliminating this information, we score it with an appropriately lower weight as we discuss in Section 3.2.

Once the binary representation of the record database has been generated, we compute an additional blocking key database. This database serves as a similarity index for a given attribute in the record, frequently the name of the entity represented by the record. This blocking key database is generated by computing the LSH values for the company name of each record. Then for each LSH value all the record indices, for which the company name attribute maps to the same LSH value, are stored. We also refer to the LSH value as blocking key.

For the computation of the LSH values, attributes are cleaned again and converted into Jaccard sets using bigrams. The idea of this cleaning step is to normalize commonly used notational variations. For instance, one of our databases contains an entry for “téléski” whereas another database contains an entry for “teleski”, both referring to the same entity. The following table shows the corresponding Jaccard sets.

Name Jaccard-Set
téléski té, él, lé, és, sk, ki
teleksi te, el, le, es, sk, ki

Except for sk and ki, all bigrams are different giving a rather low Jaccard similarity of , hence the probability that these two representations are mapped onto the same block is low. Therefore, we apply the following additional cleaning operations:

Remove diacritics:

We remove all diacritic marks such as accents, umlauts, etc., as motivated by the above example.

Remove legal entity types:

We remove the legal entity type of a company. In some databases this information is omitted which would make it harder to match the shortened company name unless the legal entity type is removed.

Merge single characters:

Acronyms are sometimes combined into a single word, sometimes separated by dots, and sometimes separated by dots and spaces. This step ensures that bigrams are consistently formed for these acronyms.

Merge numbers:

Sometimes numbers are separated using spaces to make them more readable. This operation ensures a consistent number representation.

These cleaning operations ensure that records with notational variations are assigned the same blocking key by the LSH function. Of course, this will generate a number of incorrect matches that will have to be removed by our subsequent scoring algorithm.

3.2 Runtime Pipeline

The runtime pipeline links queries to the entities stored in the entity database. It computes the blocking keys and retrieves the corresponding candidate entities. It also transforms the query into a more efficient representation in the form of a tree structure called scoring tree. The scoring tree is evaluated against the candidate entities and the matching result records are sorted. Depending on the request the top- matching records or all records with a score higher than a given threshold are returned as matches.

The scoring tree uses different scoring algorithms depending on the type of data to be processed. If the data describes an address, we use a geographic scoring, whereas if it describes a company name, a scoring algorithm tuned for company names is used. If multiple types of data are present, the scoring tree combines the scores into a single value.

More formally, the goal of the scoring function is to evaluate the similarity between a query record and a record in a reference dataset such that . If then and are identical.

The query record contains information typically provided by an application. For example, a named entity recognition (NER) component extracts entities from some unstructured text. By the nature of such a process, the information extracted may vary considerably. Certain attributes might not be recognized while some attributes might be present multiple times. For example, the industry or the street address might not be recognized but several cities or countries might be present. Therefore, the query record must be able to accommodate this variability. To this end, we use a data structure where, except for the company name, all the fields are optional and can have multiple instances. Figure 2 shows the data structure for the query records and the reference dataset records.

Figure 2: Data structures of records.

The purpose of the scoring function is to compute: (i) the scores related to individual attributes composing the query and reference records, and (ii) the appropriate combination of those scores into a single representative score. The individual characteristics of a record require completely different scoring semantics. Moreover, the combination of individual characteristics follow different rules as well. For example, when combining scores of company name, address, and industry, a weighted sum of the scores is an appropriate approach. Whereas when combining scores of multiple addresses, a maximum over the individual scores is the right approach.

In this paper we are going to focus on three scores , and , with values in and representing respectively the company name, the geographic location and the industry scores. This does not limit the generality of our approach as this list can easily be extended. The score can be now expressed as a weighted sum of these scores:

where . RL must accommodate variable inputs. Therefore, if an attribute is not present, we assume that the corresponding score is and its weight is redistributed among the weights of other attributes.

3.2.1 Scoring Company Names

In order to score company names, we started out with different string similarity functions, such as, the Jaccard similarity that MinHash is based on, or the Levenshtein distance . However, as we need to obtain a score valued in for two strings and representing company names of length and , we use the following formulations:

that we name Levenshtein and Jaccard scores correspondingly.

A first limitation we observed showed that the Jaccard and Levenshtein scores give too much weight to diacritics. As an example, let us consider a German company name where ü can be written as ue. The following table presents the scores with and without the corresponding umlaut:

Levenshtein Jaccard
Dürr vs. Durr
Dürr vs. Duerr

A naïve approach is to simply remove all the diacritics as part of the cleaning step. However, there are company names that only differentiated by the presence of diacritics. For instance, again from the German language, Wächter, Wachter and Waechter represent all existing but different companies. To tackle this problem, we leverage a property of Unicode representation where diacritics are represented as special combining characters. The combining characters are given a lower weight in the scoring process.

Another challenge is to deal with legal entity types of companies such as “inc.” or “ltd.”, which may or may not be included in the company name. In our initial attempt we simply removed these legal entity type identifiers. However, we soon came across companies where the names differ only by the legal entity type but are actually distinct companies. This is one of several occurrences where cleaning had a negative effect on scoring and this confirms the observations made by Randall et al. [31]. Generally, one approach to alleviate the problem related to special mentions (e.g. legal entity types) is to assign to them a reduced weight in the scoring process. Therefore we adopted the approach of assigning legal entity types the same weight as a single character minus a small value111The choice of the actual “small” value is driven by the Levenshtein function implementation. . To understand this approach, let us consider three distinct companies Garage Rex AG, Garage Rex GmbH and Garage Rey AG. The table hereafter shows the Leveshtein score and its modification considering legal entity types:

Levenshtein
Levenshtein
modified
Garage Rex AG
vs.
Garage Rex GmbH
Garage Rex AG
vs.
Garage Rey AG

As we can see, the fact of subtracting allows to distinguish the case where changes are not in the legal entity type.

The words in company names can be permuted. For instance, IBM Zurich Research Lab and IBM Research Zurich denote the same company. This case is covered by the fact that the Jaccard similarity handles permutations.

In some situations the city name can be included in the company name. For example, sometimes IBM Research Zurich is indicated as IBM Research if it is clear from the context that the geographic region is Switzerland. To handle this situation, we detect city name mentions in a company name and reduce its weight if the city is in the company’s vicinity. This allows to have more flexibility in the names. To lookup city names we use a fast trie described in the next section.

Additionally, we derive for each company name a short company name as we will show in Section 4. Words that are part of the short name are weighted higher, that is three times the normal weight. This approach allows to give more emphasis to the characteristic words of the company compared to other elements present in the name.

Finally, the company name score is computed as:

where and are respectively the Levenshtein and Jaccard scores, modified with the considerations above, and applied to the company names in and . The rationale behind this choice is that Jaccard score allows for word permutations while the Levenshtein score relies on the character sequence. This fact, however, is not well represented by a weighted average. The reason, we do not simply use the maximum between the two similarities is that the Jaccard similarity may return, in certain cases, a similarity of 1 for names that are different. This is to ensure that a match with a Jaccard similarity of 1 is not by coincidence chosen over a Levenshtein similarity of 1 which is only possible if the strings are equal. The values of 0.9 and 0.1 have been chosen arbitrarily.

3.2.2 Scoring Geographical Locations

A geographical location is described by the means of an address element. This element contains the street address, postal code, city and country code attributes. Each component is scored using a specific algorithm. To compute the geographical location score we again use a weighted sum of the scores of each individual attribute, each score being valued in :

Street address is currently scored using a tokenized string matching (e.g. Levenshtein tokenized distance [27]). This provides a reasonable measure between street address strings especially if street number and street name appear in different orders. However this scoring can be improved by using a geographic location lookup service.

Postal codes are evaluated according to the number of matching digits or characters. We start from the left most position, identify the longest sequence of matching digits or characters in the postal codes of length 222If the postal codes to be compared are of different lengths we take for the length of the longest postal code and then apply a logarithmic function:

The rationale behind this approach is that, to the best of our knowledge, the vast majority of postal code systems are organized in a hierarchical fashion.

If the GPS location is available in the reference dataset, the city is scored using the Haversine distance [20]. To retrieve the GPS location of the city mentioned in the query record, we use a trie data structure which contains the names and GPS position of ca. 195’000 cities worldwide obtained from geonames.org [3]. To evaluate the score , we compute the Haversine distance between cities (in km) associated with an exponential decay:

As a fall back, if the GPS position is not available or the city in the query record cannot be found in the trie, we use the Levenshtein score (described in 3.2.1) between city names.

Finally, the country code is a simple comparison: if the countries match, then and otherwise.

To provide a final score for the location, we compute a weighted sum across all the address element scores using the following weights:

Weight Value

3.2.3 Scoring Industries

Typically industries are represented by four-digit Standard Industry Classification (SIC) codes [4]. Similarly to postal codes, SIC industry codes are also hierarchical: the first two left most digits represent a “Major Group” (e.g. Mining, Manufacturing and others), the following digit is the “Industrial Group” and, finally, the last digit is the specific industry within the industrial group. When representing an industry, codes of variable length can be used depending on the level of generality of the representation. To evaluate the industry score , we are using a measure similar to the one used for postal codes:

where is the minimum length of the two SIC codes to be compared and is the number of matching digits.

3.2.4 Computing and Combining Scores

Figure 3: Scoring tree example.

As we have already mentioned, the combination of multiple scores can be done in various fashions. We have developed a tree based technique to perform the scoring and combine the results. In this approach, leaves correspond to the scoring of individual query attributes while the nodes represent combining functions. A depth-first traversal of the tree allows to perform all the scorings and combining operations in order to obtain the final score. If there are multiple attributes of the same type then these leaves are attached to a parent node representing the combining function for the attribute scores. Figure 3 shows an example of such a scoring tree. As we can see, in this example the two address elements are going to be scored according to the approach described in Section 3.2.2 and the parent node will compute the max of the computed scores. The same approach is used for the industry scores. Finally, the root node will combine with a weighted sum the contained scores.

The overall score computation is done using the following weights:

Weight Value

The rationale behind these values is that we want to privilege the company name over the geographical location and the industry. Similarly we want to give location precedence over the industry. Currently all the weights in the scoring process are manually set and correspond to the human intuition of the attribute importance. In the future when the sufficient ground truth data is available, we plan to leverage the parameters automatically by optimizing the RL accuracy using the ground truth dataset.

3.3 Implementation

The design an implementation of our RL system is driven by three main goals: versatility, speed and scalability.

Versatility is given by the generality of the approach. As we have shown in the previous sections, the various components have been designed to be able to accommodate virtually any reference dataset and to perform RL on a large variety of entities. The scoring function set can be extended to other attribute types, e.g. product names, person names, and others. Also, the scoring tree can be adapted to accommodate these new attribute types with appropriate combining functions. The central element of the system is a generic “linker” which can be easily configured to load a preprocessed dataset and perform linkage. To maximize performance in terms of speed, the linker has been written in C++ and loads the entity database into memory. Therefore, once the linker is started and initialized, all operations are performed in memory. Also the linker uses a multi-threaded approach such that asynchronous RL requests can be processed in a parallel fashion and exploit at best the cores available on the physical system. Each reference dataset and, therefore, the associated entity databases is loaded in a specific linker.

To ensure scalability we have adopted a containerized approach; each linker runs in an individual container. In conjunction with a container orchestration system, such as Kubernetes, it is possible to run and dispatch linkers on multiple physical machines. This approach allows to have a linear scaling with the number of nodes that are added to the cluster as well as the ability of running, simultaneously, linkages against multiple datasets. Moreover, the overall system is resilient to node failures which is an important characteristic for an enterprise grade application.

4 Short company names

Usually full company names can contain many accompanying words, i.e. ‘ Systems, Inc.’ in ‘Cisco Systems, Inc.’. that contain additional information about a company organizational entity type, its location, line of business, size and share in the international market. The accompanying words often vary greatly from one data source to another. For example, some systems will have just ’Cisco’ instead of the conventional name ‘Cisco Systems, Inc.’. These are particularly popular in the unstructured data sources, such as media publications or in the financial reports, where many company sites are aggregated.

Short company names (sometimes also called colloquial or normalized company names) represent the most discriminative substring in a company name string. In many cases, when a query company name is very generic, such as ‘Cisco’, there might be a large set of valid correct matches in a reference database. In order to retrieve all the matches correctly it is important to extract and compare corresponding short company names. Short names also allow to find matching candidates when company names have certain variabilities in both one to one and one to many look up cases.

It was already shown in the work [26] that there is a great benefit for company record linkage to take into account short (colloquial) company names. However, the company entity matching system described in [26] used manually created short company name corpus, while in this work we focus on the automated short name extraction. In our deployment the availability of short company names help both efficiency and accuracy of the RL system, as they lead to smaller and more descriptive blocks on one hand and help to give more attention or weight to the most discriminative part of a company name on the other.

4.1 Corpus Building

As a corpus for short name extraction we use company data from DBpedia [5], and from a proprietary data source describing company hierarchies. The proprietary data come from Dun & Bradstreet, Inc. [10], where unique DUNS number is assigned to a company business location.

DBpedia.

DBpedia contains around 65K company entities derived from English version of Wikipedia. The company entities contain a name, a label and a homepage of a company. We use all this fields in order to derive a company short name as in most of the cases it is contained either in the label or in the homepage of a company. For example, a company named ‘Aston Martin Lagonda Limited’ has a label ‘Aston Martin’, in this and similar cases based on the handful of heuristically devised rules, we conclude that ‘Aston Martin’ is a short name of a company. Similarly, a company with a name ‘Cessna Aircraft Company’ has a homepage

http://www.cessna.com/, thus, ’cessna’ is used as a short name for training. After getting the short name of a company using its name and homepage, we do additional cleaning in order to exclude the company entity type, such as ‘.inc’, ‘corp.’ and others. We use the list of business entity types by country provided by Wikipedia [38]. In order to augment the ground truth data with more values, we use two kinds of transformations: from a name to a short name and from a label to a short name. Thus, we make sure that the system will be able to correctly extract short names not only from the long official name of a company but also from its shorter, sometimes trivial, versions that are often used in free text sources, i.e., news articles.

We analyze the number of words in the ground truth data both for long and short versions of the DBpedia company names. The distribution of the number of words can be seen on Figure 3(a). The long company names consist mostly of two or three words reaching the maximum of words. The short names are mostly covered by one or two words, with maximum words per company name.

(a) DBpedia corpus.
(b) DUNS corpus.
Figure 4: Histograms of the number of words in the long and short versions of a company name.

DUNS Company Data. Another source of training data in our deployment is the company data provided by Dun & Bradstreet, Inc. This data contains company entities, such as branches, subsidiaries and headquarters, all having an individual DUNS number. The set of all DUNS numbers associated with a company can be hierarchically represented. Based on the hierarchies we identify the families of companies that are placed in a single DUNS tree. For each family of companies we extracted the common tokens of the company names as a short name for all the family. After the extraction of common tokens, similarly to the previous case, additional checks were performed in order to exclude legal entity types of companies from the token list. The remaining tokens were combined and used as a short name for all the company names in the family. For example, from a family of companies that have two distinct names: ‘ZUMU HOLDINGS PTY LTD’ and ‘ZUMU FOODS PTY LTD’ we extracted ‘ZUMU’ to be the representative short name. Given this data source we were able to extract 950K of long-short name pairs for training. In total, more than a million pairs of long and short company names were used as a corpus for automatic extraction of company short names.

(a) DBpedia corpus.
(b) Aggregated DBpedia and DUNS corpus.
Figure 5: CRF performance for company short name extraction.

In the case of DUNS company data the distribution of the number of words within the short and long names is quite different then from the DBpedia case (Figure 3(b)). Long names tend to contain more words (3-5 words occupy a large fraction of probability mass), short names also have two words much more frequently than in DBpedia case. In general the task of short name extraction in the case of DUNS data is more difficult as within the family of companies the variability of names is higher and often short name is the most discriminative part of the name, while also some other quite discriminative words should be left behind. For example for a family of companies:

  • ‘SUNSELEX Verwaltungs GmbH’,

  • ‘SUNSELEX GmbH solar resources’,

  • ‘SUNSELEX AG’,

  • ‘SUNSELEX GmbH solar general constructor’

only ‘SUNSELEX’ should be extracted for each of the family members, even though ‘solar resources’ and ‘solar constructor’ are still quite discriminative words for the corresponding company names. On the other hand, there are cases when quite a ‘long’ short company name should be detected. For example, ‘Yunhe County Jincheng’ from the family: ‘Yunhe County Jincheng Arts & Crafts Gifts Factory’ and ‘Yunhe County Jincheng Wood Industry Co., Ltd.’, where the number of words that should be left and should be skipped in order to obtain the short name is approximately the same. As can be seen from the support pie chart on Figure 4(b), indeed, for the overall corpus, where the DUNS portion is dominant, the number of words that should be excluded is slightly higher than the number of words that should be left in a short name.

4.2 Short name modeling

We treat the short name learning as a sequence labeling task, where for each word in a sequence we need to decide wether the word stays or goes from a company name. Conditional Random Fields (CRF)  [24] is one of the best performing models applied for sequence labeling [19]. Currently, the modifications and add-ons of CRF are gaining popularity for complex labeling tasks, such as Bidirectional LSTM-CRF Models [19] for Part of Speech (POS), chunking and Name Entity Recognition (NER) tasks. As in our case we have only two labels ‘IN’, which meant that the word is included in a short name and ‘OUT’ when a word is excluded, we use ‘just’ CRF classifier, so that only limited amount of parameters have to be trained.

CRF takes into account the neighborhood of a word to make the decision about the label and also the predefined set of features for the word and for its neighbors. In addition to the usual features that are used by CRF for NLP sequence labelling tasks, such as the word itself, checks of its capitalization and postfixes, we also include some additional features specific to our application:

  1. frequency rank, that corresponds to the order of the words in a company name according to the overall frequencies found in the training corpus;

  2. normalized frequency or the relative frequency of a word compared to other words in a company name;

  3. absolute frequency.

The feature choice stands on the following grounds. Many company names contain unique words that constitute the short company name. Our hypothesis is that the frequency of a word in a company name has a significant influence on the fact whether the word is included in a short name or not. We have checked this hypothesis using the DBpedia corpus and, apparently, in of the cases, the word with the minimum relative frequency among the words in a company name ends up in the short name of a company. Among the other of the cases there are mainly generic company names with relatively small frequency differences among the words, i.e. ‘American Fitness Association’. We have used all available data sources in the company domain to compute the frequencies of the words. For example, for DUNS data we have used company name, company trade name, city, country and industry type in order to compute overall frequencies.

Model Evaluation. In order to evaluate CRF for the task of short name extraction, precision, recall and F1-score are computed separately for ‘IN’ and ‘OUT’ classes. We also present micro and macro averages for each performance measure. The plots for DBpedia corpus and for the aggregated DBpedia and DUNS corpus are shown on Figure 5.

The results demonstrate that CRF is able to distinguish between discriminative and non-discriminative words in a company name as all the performance measures are higher than 0.76 for all the datasets in consideration. Indeed, the task for DBpedia names is easier and CRF reaches around 0.9 overall accuracy for both classes. For the larger corpus, the model struggled to reveal all words that should have been included in the short name, providing the recall for ‘IN’ class, which is equal to . For other performance measures the values are close to 0.81.

The model is applied to extract the short names in the main record linkage system presented above, with the results in Section 5.

It is important for our application to keep the subset of words from original company name in order to maintain the closest discriminative name representation. Thus, we do not produce abbreviations or other name representations that contain only parts of the words from the initial company name. In a future work abbreviations are going to be considered in addition.

5 Evaluation

To evaluate our RL system, we use two internally available datasets. One is thoroughly developed by our team through multiple company checks. Another represents the manual matching done by the financial specialists. It also has high quality links, but unfortunately in many cases only one to one match is available, while additional checking demonstrates that one to many match would better describe the reality.

The first ground truth dataset (dubbed Set A) is derived from two important databases containing company information are the “Dun & Bradstreet” [10] ( DUNS) and the “Capital IQ” [36] (CapIQ) databases. The DUNS database is maintained by The Dun & Bradstreet Corporation and contains more than 300 million business records divided into active and inactive entries. It is one of the most used marketing intelligence databases and, according to Dun & Bradstreet, it is used by 90% of the Fortune 500 companies. The CapIQ database is maintained by the research division of Standard and Poor’s and provides important financial data about companies.

We built our ground truth randomly selecting 450 companies, with locations in Switzerland, from CapIQ and matched them manually against DUNS. Our familiarity with the region was instrumental to correctly identify records referring to the same company. Moreover, the language diversity in Switzerland allows to assess the system in combination with different languages. Records were divided into the following categories:

Matched

In this case all possibly correct matches are listed. For instance, if CapIQ was missing the address, or listed many subsidiaries and the DUNS database only listed the headquarter. (296 records, 196 unique records)

Unmatched

If no corresponding company was present in the DUNS database. (114 records)

Undecided

In case we were unable to conclusively decide whether the companies are the same or if one of the companies was renamed to different name e.g. due to a merger. These records were counted neither as true positive nor as negative but only as false positive if they were matched against a different record. (80 records)

We assess the possible error of the overall accuracy computation to be around 3.3%. This results from applying the confidence interval computation 

[37]

under the assumption of normal distribution of the accuracy measures:

where , the size of our ground truth set, not counting undecided records and only counting unique matches; , our recall, as we will demonstrate below.

The other ground truth dataset is based on an internal database that matches internal company data against the DUNS database from which we can assume that of records are accurate. In contrast to the first ground truth set, it contains global data matches. However, this database only lists a single corresponding DUNS record and not all possibly correct matches.

5.1 Row-Band Configuration

In this section, we evaluate the performance of our system with different MinHash row-band configurations using our ground truth Set A. We identified that correct matches typically have a Jaccard similarity () greater than 0.8. However, some correct matches have a score as low as 0.6. Using these numbers we have chosen three row-band configurations such that matching records with a Jaccard similarity of are matched with a probability and those with a similarity of with a probability . Considering that correct matches with a score

are outliers, based on our ground truth set, the 75% figure is a trade-off between performance and matching accuracy. These row-band configurations are shown in Figure 

6 and Table 1.

Figure 6: S-Curves (left: full; right: zoomed; x-axis: Jaccard-similarity; y-axis: matching probability)
minhash 4/10 5/18 6/30
47.5% 43.5% 37.6%
75.0% 76.7% 76.1%
93.5% 96.3% 97.6%
99.4% 99.9% 99.9%
Table 1: minhash matching probabilities

These configurations allow us to evaluate the trade-off between the different MinHash configurations. A higher number of rows and bands gives a sharper S-curve (described in 2.1). Hence, entities with a low score are less probable to be considered as match. However, this comes at the expense of having to compute more MinHashes (numerically, ) as well as consuming more memory to store the additional bands.

Table 2 shows the results of linking the CapIQ records from our ground truth dataset against DUNS. The recall of each configuration is approximately 87% (). This is not surprising considering that the S-curve was configured to capture all company names with a Jaccard similarity of with a  probability. It has to be noted that the memory consumption grows almost linearly with the number of bands.

The number of comparisons necessary for each CapIQ record initially looks surprising, because the S-curves are relatively close to each other. However, considering that the scoring distribution among candidate entities has a heavy tail distribution with considerable more candidate entities having a low Jaccard similarity, this difference is easily explained.

Finally, the time spent computing the MinHashes for each record to be matched was negligible, which looking at the table is not surprising, considering the fact that each record has to be compared to 23 thousand to 72 thousand candidate records.

MinHash 4/10 5/18 6/30
recall 86.67% 87.18% 87.18%
database size 38.8GiB 57.6GiB 99.5GiB
comparisons 72.9k 55.0k 23.6k
Table 2: Memory and Performance Comparison

5.2 Scoring Strategies

In Section 3.2.1, we have described our algorithm for scoring company names. Figure 7

compares the precision and recall numbers of this strategy to the following strategies:

Figure 7: Comparison of Distance Functions for the Different MinHash Configurations: (encircled data point), followed by and (x-axis: precision [%]; y-axis: recall [%])
Jaccard:

This uses the Jaccard similarity as the score ().

Levenshtein:

This uses the Levenshtein similarity as the score ().

weighted:

This uses the arithmetic mean between the Jaccard and Levenhstein similarities as the score ().

max-min:

This is the scoring that we have presented in Section 3.2.1, except that the combining characters, legal entity type, and city optimizations are not activated ().

RLS:

This is the scoring strategy as presented in Section 3.2.1. In addition, tokens that constitute the detected short company names are weighted 3 times higher that all the other name tokens.

The similarity functions are shown for the row-band configurations discussed previously: (encircled), followed by and . Again, the results are very similar for the different band configurations, possibly a bit better for those with higher band numbers which would be supported by the fact that the matching probability for similarities , to include most outliers, is slightly higher for higher band numbers.

The Jaccard similarity has a lower recall than the Levenshtein similarity because the former is more sensitive to small changes in the name as we have seen in the Dürr example previously. As a consequence, its precision is higher. The weighted approach lies somewhere in the middle between the two.

The max-min strategy compared to the weighted strategy gives similar results in terms of recall. However, it achieves less precision. This can be explained in the case where two company names have a “high” Jaccard score while having a “low” Levenshtein score. For example, if the Jaccard score is 1.0 and the Levenshtein score is 0.4, then the arithmetic mean is 0.7, barely above our threshold. Whereas max-min gives a score of more closely resembling the Jaccard similarity.

The comprehensive RLS approach shows significant improvements in terms of recall. The precision is similar to the max-min strategy but below the weighted or Jaccard strategies. This is due to the fact that it finds matches for records in the ground truth dataset that have no corresponding matches in the DUNS database. In this case it is almost impossible to discern close matches to non matches. Considering that we favor recall over precision, this is a good trade-off.

5.3 Credit Request Data

The global ground truth dataset Set B consists of 54K records matching internal company data against entries in the DUNS database. This allows us to create test data in the following format using the prefix I for the internal data attributes and D for the DUNS data attributes:

One shortcoming of this dataset is that it only links each internal company record to a single DUNS record; large companies have multiple locations each with a unique DUNS number that are equally correct. In order to be able to assess our RL system on the basis of this ground truth dataset, we matched the internal company data consisting of company name, its address and others, against our RL system and obtained the following matching results:

In the following we use to refer to the DUNS number in the ground truth set and to refer to the same element in the data returned by our RL system. As shown in Table 3, among the 54K  global ground truth records, around 25.5K  records are linked to the same DUNS #, giving around 28.6K  records linked to a different DUNS #. Which provides us with 47% recall. As mentioned before, however, from the set of correct DUNS # only a single number is included in this ground truth set. Hence, we cannot yet conclude that the remaining 53% of the matches are incorrect.

Description Records Matches
# of all ground truth records 54’234
Among them
25’570 47%
among the remaining 28’664 records
15’442 75.6%
among the remaining 13’222 records
2’597 80.4%
gives 10’625 unaccounted records.
Table 3: Matching Analysis of Set B.

Among the 28.6K, 15.4K are trivial matches where the DUNS name in the GT record is equal (including spacing and punctuation) to the DUNS name in the RLS record. Based on the assumption that two records in the DUNS database with the same company name refer to the same company (i.e., “International Business Machines” is “International Business Machines”), we count these records as correct matches.

Again from the remaining 13.2K records, another 2.6K are trivial matches where the internal name Iname is equal (including spacing and punctuation) to the DUNS name in the RLS record based on the same principle. These two cases are related to the fact that the ground truth database has only one DUNS number per entry.

This gives around 10.6K of unaccounted records. These records may contain both correct and incorrect matches. To quantify these numbers, we have randomly chosen 204 records for manual verification. Among these records, 88 records are correctly and 102 are incorrectly matched, 14 companies could not be matched although a match should have been found, 5 companies could not be matched because the is no longer present in the DUNS database of active companies, hence the non-matches are correct. This gives a precision of and a recall of for these 204 records.

Assuming that these records are representative for the set of unaccounted records, we can compute the overall recall and precision as follows:

This is very close to what we have observed with our Swiss ground truth set. The precision figure is much higher because there is only a small number of unmatchable records in the Global ground truth set.

6 Conclusions and Future work

In this work we presented a fast RL system for company matching. We showed that the proposed system is able to match 30% of the otherwise unmatched entity records. This improvement is due to two contributions: the introduction of short company names extractions and their use both in the preprocessing phase as well as in the scoring phase; specific improvements of the scoring function, namely taking into account diacritic character, legal entity type, and the ability of identifying geographic locations in company name.

Additionally, as deployed today in a cluster with three nodes, we are capable of linking a record every 17ms, which means we are able to match approximately 5’000’000 records per day. These performance figures scale linearly with the number of nodes in the system. Hence, our system is perfectly suited for analyzing high-volume streamed contents.

As future work, we will:

  • take into account company abbreviations;

  • consider the historical evolution of company names;

  • maintain automatic parameter learning and automatic training dataset augmentation.

  • explore the use of other LSH functions such as SimHash [28] to assess whether our recall values can be improved further.

References

  • [1] https://www.ibm.com/watson/services/natural-language-understanding/. Accessed: 2019-05-28.
  • [2] https://www.unicode.org/standard/standard.html. Accessed: 2019-05-28.
  • [3] https://www.geonames.org. Accessed: 2019-05-28.
  • [4] Standard industry classification. https://www.osha.gov/pls/imis/sicsearch.html.
  • [5] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives. Dbpedia: A nucleus for a web of open data.
  • [6] L. Barbosa, V. Crescenzi, X. L. Dong, P. Merialdo, F. Piai, D. Qiu, Y. Shen, and D. Srivastava. Big data integration for product specifications. IEEE Data Eng. Bull., 41(2):71–81, 2018.
  • [7] A. Z. Broder. On the resemblance and containment of documents. In Proceedings. Compression and Complexity of SEQUENCES 1997, pages 21–29, June 1997.
  • [8] A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-wise independent permutations (extended abstract). In

    Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing

    , STOC ’98, pages 327–336, New York, NY, USA, 1998. ACM.
  • [9] M. S. Charikar.

    Similarity estimation techniques from rounding algorithms.

    34th STOC, pages 380–388, 2002.
  • [10] Dun & Bradstreet Company. http://www.dnb.com, 2019.
  • [11] M. Elfeky, V. Verykios, A. Elmagarmid, T. Ghanem, and A. Huwait. Record linkage: A machine learning approach, a toolbox, and a digital government web service. Computer Science Technical Reports, Purdue University, (1573), 2003.
  • [12] O. Ertl. Superminhash - a new minwise hashing algorithm for jaccard similarity estimation, 2017.
  • [13] I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Association, 64(328):1183–1210, 1969.
  • [14] L. Getoor and A. Machanavajjhala. Entity resolution: Theory, practice & open challenges. Proceedings of the VLDB Endowment, 5:2018–2019, 08 2012.
  • [15] R. D. Gottapu, C. Dagli, and A. Bahrami.

    Entity resolution using convolutional neural network.

    Procedia Computer Science, 95:153–158, 12 2016.
  • [16] L. Gu, R. Baxter, D. Vickers, and C. Rainsford. Record linkage: Current practice and future directions. CSIRO Mathematical and Information Sciences Technical Report, 3, 06 2003.
  • [17] A. J. Gurmeet Singh Manku and A. D. Sarma. Detecting near-duplicates for web crawling. 16th international conference on World Wide Web, pages 141–150, New York, NY, USA, 5 2007. ACM.
  • [18] A. Horvath. Murmurhash3, an ultra fast hash algorithm for c# .net. http://blog.teamleadnet.com/2012/08/murmurhash3-ultra-fast-hash-algorithm.html. Accessed: 2019-05-28.
  • [19] Z. Huang, W. Xu, and K. Yu. Bidirectional LSTM-CRF models for sequence tagging. CoRR, abs/1508.01991, 2015.
  • [20] J. Inman. Navigation and Nautical Astronomy: For the Use of British Seamen (3 ed.). London, UK: W. Woodward, C. & J. Rivington, 1835.
  • [21] A. R. Jure Leskovec and J. D. Ullman. Mining of Massive Datasets. Cambridge University Press, Cambridge, United Kingdom, 2 edition, 2014.
  • [22] P. Konda, J. Naughton, S. Prasad, G. Krishnan, R. Deep, V. Raghavendra, S. Das, P. Suganthan G. C., A. Doan, A. Ardalan, J. R. Ballard, H. Li, F. Panahi, and H. Zhang.

    Magellan: toward building entity matching management systems over data science stacks.

    Proceedings of the VLDB Endowment, 9:1581–1584, 09 2016.
  • [23] S. Kwashie, J. Liu, J. Li, L. Liu, M. Stumptner, and L. Yang. Certus: An effective entity resolution approach with graph differential dependencies (gdds). PVLDB, 12(6):653–666, 2019.
  • [24] J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML 2001), Williams College, Williamstown, MA, USA, June 28 - July 1, 2001, pages 282–289, 2001.
  • [25] P. Li and C. König. b-bit minwise hashing. 19th international conference on World Wide Web, New York, NY, USA, 2010. ACM.
  • [26] M. Loster, Z. Zuo, F. Naumann, O. Maspfuhl, and D. Thomas. Improving company recognition from unstructured text by using dictionaries. In EDBT 2017, pages 610–619, 2017.
  • [27] K. Mirylenka, P. Scotton, C. Miksovic, and A. Schade. Similarity matching system for record linkage. US Patent Application P201704804US01, 2018.
  • [28] I. M. V. U. Moses S. Charikar, Google. Methods and apparatus for estimating similarity.
  • [29] S. Mudgal, H. Li, T. Rekatsinas, A. Doan, Y. Park, G. Krishnan, R. Deep, E. Arcaute, and V. Raghavendra. Deep learning for entity matching: A design space exploration. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD ’18, pages 19–34, New York, NY, USA, 2018. ACM.
  • [30] K. Qian, L. Popa, and P. Sen. Active learning for large-scale entity resolution. pages 1379–1388, 11 2017.
  • [31] S. M. Randall, A. M. Ferrante, J. H. Boyd, and J. B. Semmens. The effect of data cleaning on record linkage quality. BMC Medical Informatics and Decision Making, 13(1), June 2013.
  • [32] R. M. Sariel Har-Peled, Piotr Indyk.

    Approximate nearest neighbor: Towards removing the curse of dimensionality.

    THEORY OF COMPUTING, 8:321–350, 2012.
  • [33] A. Shrivastava. Optimal densification for fast and accurate minwise hashing, 2017.
  • [34] A. Shrivastava and P. Li. In defense of minhash over simhash. AISTATS, 2014.
  • [35] M. T. Søren Dahlgaard, Mathias Bæk Tejs Knudsen. Fast similarity sketching, 2017.
  • [36] S&P Global Market Intelligence. https://www.capitaliq.com/, 2019.
  • [37] L. Sullivan. http://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/BS704_Confidence_Intervals/BS704_Confidence_Intervals_print.html. Accessed: 2019-05-28.
  • [38] Wikipedia contributors. Wikipedia, the free encyclopedia. https://en.wikipedia.org/w/index.php?title=Plagiarism&oldid=5139350, 2019. [Online; accessed May-2019].