NSEEN: Neural Semantic Embedding for Entity Normalization

11/19/2018 ∙ by Shobeir Fakhraei, et al. ∙ USC Information Sciences Institute 0

Much of human knowledge is encoded in the text, such as scientific publications, books, and the web. Given the rapid growth of these resources, we need automated methods to extract such knowledge into formal, machine-processable structures, such as knowledge graphs. An important task in this process is entity normalization (also called entity grounding, or resolution), which consists of mapping entity mentions in text to canonical entities in well-known reference sets. However, entity resolution is a challenging problem, since there often are many textual forms for a canonical entity. The problem is particularly acute in the scientific domain, such as biology. For example, a protein may have many different names and syntactic variations on these names. To address this problem, we have developed a general, scalable solution based on a deep Siamese neural network model to embed the semantic information about the entities, as well as their syntactic variations. We use these embeddings for fast mapping of new entities to large reference sets, and empirically show the effectiveness of our framework in challenging bio-entity normalization datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Keywords

Deep Learning, Siamese Networks, Entity Grounding, Entity Normalization, Entity Resolution, Entity Disambiguation, Record Linkage, De-duplication, Entity Matching, Data Integration, Similarity Search, Similarity Learning, Metric Learning, Large Scale Reference Set

1 Introduction

Digital publishing has accelerated the rate of textual content generation to beyond human-consumption capabilities. Taking scientific literature as an example, Google Scholar has indexed about four and a half million articles and books in 2017 in a 50% increase from the previous year. Automatically organizing this information into a proper knowledge representation is an important way to make this information accessible. This process includes identification of entities in the text, often referred to as Names Entity Recognition (NER), and mapping of the identified entities to existing reference sets, called Entity Grounding, Normalization, Resolution, or De-duplication111Although there are subtle differences between these terms, we will use them interchangeably in this paper. The formal definition of the task we consider is provided in section 2.. In this paper we focus on providing a neural-based solution for entity normalization to a reference set.

Entity grounding to a reference set is a challenging problem. Even though in some cases normalization can be as simple as a database look-up, often there is no exact match between the recognized entity in the text and the reference entity set. There are two main sources for this variation. The first is syntactic variations, where the identified entity contains relatively small character differences with the canonical form present in the reference set, such as different capitalization, reordering of words, typos, or errors introduced in the NER process.

The second and more challenging problem, which we call semantic variations, is when the identified entity does not exist in the reference set, even when considering significant syntactic variations, but a human reader can recognize the non-standard entity name. For example, entities often have multiple canonical names in the reference set and the identified entity name is a combination of parts of different canonical names.

A further challenge is how to perform normalization at scale. Exhaustive pairwise comparison of the identified entity to the reference entities grows quadratically and is unfeasible for large datasets. Blocking [1] is a common solution to this problem. Unfortunately, blocking methods applied directly to the textual representation of the entity names is often limited to simple techniques that can only address syntactic variations of the entity names.

In this paper we develop an approach to address entity normalization at scale. Our contributions include: 1) We develop a general scalable deep neural-based model to embed entity information that can capture syntactic variations and semantic variations

in a numeric vector space. 2) We provide a method to incorporate domain knowledge about the possible variations of the entities in the embeddings. 3) We develop dynamic hard negative sampling to refine the embedding for improved performance. 4) By embedding the entities in a numerical vector space, we map the task to a standard

k-nearest neighbors problem and can deploy a scalable representation that enables fast retrieval without the need for traditional blocking. 5) We empirically show the effectiveness of our method in different domains.

2 Problem Definition

Given a reference set of entities where represents the entity that is identified via an ID (i.e., ) and a set of names that refer to the entity (i.e., ). Given a name of the query entity , our goal is to retrieve the ID of the corresponding entity in our reference set . Note that the exact textual name of the query entity may not exist in the reference set.

3 Approach

To address this task, we map it to an approximate nearest-neighbor search in a n-dimensional space where each name in the reference set is encoded with a numerical vector representation. Our assumption in this embedding space is that names of the same entity (even syntactically very different) are closer to each other compared to names of others entities. That is, s.t. , where and are entities , their corresponding names, embedding vectors of these names, and is a distance function.

We use a Siamese neural network architecture to embed the semantic information about the entities as well as their syntactic similarities. We further refine the similarities via dynamic hard negative sampling and incorporating prior knowledge about the entities using additional generated training data. We then encode and store the embeddings in a numeric representation that enables fast retrieval of the results without the need for traditional character-based blocking. Our approach consist of three steps:

Similarity Learning. We first learn an embedding function that maps the entity names to a numeric vector space where names of the same entities are close to each other.

Embedding and Hashing. Then, we embed all the names in the reference set to the numerical vector space and hash and store the reference set embeddings for fast retrieval.

Retrieval. Finally, we embed the query name (i.e., ) using the learned model and find the closest samples to it in the embedding space to retrieve the corresponding ID () of the query name in the reference set.

The following sections describe each step in detail.

3.1 Similarity Learning


In the first step we learn a function () that can map the textual representation of the names () to a numerical vector representation (). As stated earlier this function should preserve the proximity of the names that belong to the same entity.

We use a Siamese recurrent neural network model (described in section 

3.1.1) to learn the function . To train the parameters of the model we generate examples with pairs of names and a similarity score () based on names in the reference set, variations of the names, and finding hard negatives based on the current state of the model.

Figure 1: Learning the embedding function based on the semantics in the reference set and syntactic variations defined by the domain knowledge and hard negative mining.

Figure 1 shows the overall schema of the training process. Where training pairs are generated based on the names in the references set , syntactic variations, and the iterative hard negative mining process.

The similarity learning process is described in Algorithm 1. The following sections describe the details of the Siamese deep neural network architecture and training pair selection and generation process.

1:procedure TrainSim()
2:     Input: reference set
3:     Input: pairs based on knowledge of syntactic variation in the domain
4:     Generate pairs based on reference set and add them to training data
5:     Add pairs to the training data
6:     for k times do
7:         Train the model (Siamese network) on
8:         Embed all the names in :
9:         for all  do Hard negative mining
10:              find the k closest to
11:              if  then
12:                  add to training data                             
13:     return The trained embedding function
Algorithm 1 NSEEN: Similarity Learning

3.1.1 Siamese Recurrent Neural Network


We adopt a Siamese Recurrent Neural Network to encode the textual representation of the names in a numerical vector representation. There are three main parts in the architecture and training of this network that are described below:

Siamese Network:

The overall architecture of the neural network we use in this framework is a Siamese network which consists of two towers with shared weights. In other words, the same tower is copied twice where a distance function is used at the last layer to denote similarity or dissimilarity of the objects fed to each tower. This architecture have been shown to be effective in learning similarities in several domains such as text [2] and images [3]. Figure 2 depicts an overview of the network used in our framework.

Figure 2: An overview of the Siamese recurrent neural network used in our framework. We feed the names (i.e., ) to the network as sequence of characters (i.e., ) and store the values of the last dense feedforward layer of the network as the embedding (i.e., ). All the weights are shared between left and right towers of the network.

We train this Siamese network with pairs of names and a score indicating the similarity of the pairs (i.e., ). As shown in Figure 2, and are texts of the names represented with sequence of characters and and indicates the amount of similarity between the names. In the embedding space, we want to minimize the distance between vector representations of the pairs of the similar names and maximize the distance between vector representations of dissimilar names.

Bidirectional Long Short-Term Memory:

To read the character sequence of the names, we adopt a Bidirectional-LSTM variant of Recurrent Neural Networks (RNN) that have been shown to be successful in text and language related tasks. An LSTM cell is shown in Figure 3. Each LSTM cell contains a memory state () and three gates that control the input (), output (), and how much should be forgotten at this cell (). The input, the previous state, and the memory cell are parametrized by weight matrices ( and ) in the LSTM cell.

Figure 3:

The Long Short-Term Memory (LSTM) cell used in our model.

For a sequence of characters fed to the network at each time step , and are computed in the manner shown in equations (1) in a forward LSTM, where indicates the the logistic function (), the hyperbolic tangent, and denotes the Hadamard product.

(1)

Bidirectional LSTMs incorporate both future and past context by running the reverse of the input through a separate LSTM. The output of the combined model at each time step is simply the concatenation of the outputs from the forward and backward networks.

Contrastive Loss Function:

While we can use several distance functions () to compare the learned vectors of the names, we use cosine distance between the embeddings and , due to its better performance in higher dimensional spaces. We then define a contrastive loss [4] based on the distance function to train the model.

(2)

The intuition behind the loss function in equation 

2 is to pull the similar pairs closer to each other, and push the dissimilar pairs in a radius (or margin) further apart. Note that if dissimilar pairs are more than units apart, the loss is zero. In our framework we consider a margin of 1 for our experiments (i.e., ).

The contrastive loss has been originally proposed for binary labels where we either fully pull two points towards each other or push them apart. In this paper, we use real-valued labels when we introduce syntactic variations of the names described in section 3.1.2 to indicate uncertainties about the similarities of two vectors. For the margin of 1 (i.e., ), the distance that minimizes the loss function for the real-valued label is the following:222for brevity of notation we denote with

(3)

Therefor, in our setting the optimal distance between the embeddings of two names with 0.3 similarity for example (i.e., ) is 0.3. Figure 4 depicts the changes in loss when altering the distance corresponding to different values, and the points that minimize the loss are marked on each line.

Figure 4: Changes in the contrastive loss () based on variations of distance values () for different real-value labels . The distance that minimizes the loss value (i.e., ) is marked on each line. (Best viewed in color)

3.1.2 Pair Selection and Generation


In order to train the model we need labeled pairs of names (i.e., ). We generate three sets of pairs using different approaches; The initial set based on the names in the reference set, the syntactic variations set based on domain knowledge, and the hard negative set. The initial and the hard negative pairs capture the semantic relationships between names in the reference set, and the augmented syntactic variations capture the syntactic noise that may be present in referring to these names in reality. More details about each set of pairs is provided in the following sections.

Initial Semantic Set:

We generate an initial training set of similar and dissimilar pairs based on random selection of the entities in the reference set . We generate positive pairs by the cross product of all the names that belong to the same entity, and initialize the negative set of dissimilar pairs by randomly sampling names that belong to different entities. Formally:

Syntactic Variations and Entity Families:

In order to train the model with the syntactic variations that could be introduced in the real-world textual representation of the names, we add pairs of names to the training set and label them with their real-value string similarities. The argument behind using real-valued labels is provided in equation 3, with the intuition that using a label of 0 will completely repel two vectors and using a label of 1 will bring two vectors as close as possible, but using a label between 0 and 1 will keep the two vectors somewhere inside the margin.

We use Trigram-Jaccard, Levenshtein Edit Distance, and Jaro–Winkler to compute string similarity scores [5] between the pairs of names and include sets of pairs with labels based on each similarity score in the training set. The intuition is that the model will learn a combination of all these string similarity measures. To select the name pairs to include in this process, we consider two sets of variations based on the same name, and different names.

Same name variations are the noise that be introduced to a name in real-world settings. To capture the most common forms on noise introduces on the same name, we make the following three modifications based on our observation of the most frequent variations in the query names:

  • Removing the spaces. e.g., FOX P2, FOXP2, y

  • Removing all but alphanumerical characters. e.g., FOX-P2, FOXP2, y

  • Converting to upper and lower cases. e.g., Ras, RAS, y & Ras, ras, y

Different names variations introduce a higher level of similarity to the model. We make the second set of pairs by selecting the names of entities that are somehow related and computing their string similarities. For example, in our experiments with proteins we select two entities that belong to the same protein family and generated pairs of names consisting of one name from each. The labels are assigned to these pairs based on their string similarities. This set of pairs not only introduces more diverse variations of textual string similarities, it also captures a higher-level relationship by bringing the embeddings of the names that belong to a group closer to each other. Introducing such hierarchical relations to representation of entities has been shown to be effective in various domains [6].

Hard Negative Mining:

Given the large space of possible negative name pairs (i.e., cross product of the names of different entities) we can only sample a subset for training our model. As stated earlier we start with an initial random negative sample set in our training set. However, these random samples may often be trivial choices for the model and after a few epochs may not contain enough useful signal. Different negative sampling techniques, often called

hard-negative mining, have been introduces in domains such as knowledge graph construction [7]

and computer vision 

[8] to deal with similar issues.

The idea behind hard negative mining is finding negative examples that are most informative for the model. These examples are the ones closest to the decision boundary and the model will most likely assign a wrong label to them. In our setting, as shown in Figure 1 and Algorithm 1, we find the hard negatives by first embedding all the names in the reference set using the latest learned model . We then find the closest names to each name in the embedding space using an approximate k-nearest neighbors algorithm for fast iterations. We then add the name pairs found using this process that do not belong to the same entity with a 0 label to our training set and retrain the model . We repeat this process multiple times to refine the model with several sets of hard negative samples.

3.2 Reference Set Embedding and Hashing


The model that we trained in the previous step, is basically a function that maps the name string to a numerical vector. Since both towers of the Siamese network share all their weights, the final embedding is independent of the tower the string representation is provided to as input. The goal of our framework is to perform entity normalization or grounding of query names (i.e., ) to the entities in the reference set . Hence, in this step we embed all the names in the reference set using the final trained model , and store the embeddings for comparison with future queries.

1:procedure Embed()
2:     for all  do
3:               
4:     for all  do
5:         Hash and store in      
6:     return Hashed embeddings
Algorithm 2 NSEEN: Embedding & Hashing

Our final task is to assign an entity in our reference set to the query name by finding the closest entity to it in the embedding space. This assignment is basically a nearest neighbor search in the embedding space. The most naive solution to this search would entail a practically infeasible task of exhaustive pairwise comparisons of query embedding with all embeddings in a potentially large reference set. Moreover, since we iteratively repeat the nearest neighbor look-up in our training process for hard-negative mining, we need a faster way to retrieve the results.

However, this challenge is prevalent in many research and industry applications of machine learning such as recommender systems, computer vision, and in general any similarity-based search, which has resulted in proposal of several fast

approximate nearest neighbors approaches [9, 10]. We speed-up our nearest neighbors retrieval process by transforming and storing our reference set embeddings in an approximate nearest neighbors data structure. Algorithm 2 describes the overall process of this stage.

To do so, we employ a highly optimized solution that is extensively used in industry applications such as Spotify to deal with large scale approximate nearest neighbor search, called Annoy (Approximate Nearest Neighbors Oh Yeah!) [11]. Annoy, uses a combination of random projections and a tree structure where intermediate nodes in the tree contain random hyper-planes dividing the search space. It supports several distance functions including Hamming and cosine distances based on the work of Bachrach et al. [12]. Since we have already transformed the textual representation of an entity name to a numerical vector space, and the entity look-up to a nearest neighbor search problem, we can always use competing approximate nearest neighbors search methods [13], and the new state-of-the-art of approaches that will be discovered in the future.

3.3 Retrieval


During the retrieval step, depicted in Algorithm 3 we first compute an embedding for the query name based on the same model

that we used to embed the reference set. We then perform an approximate nearest neighbor search in the embedding space for the query name. We then return the ID of retrieved neighbor as the most probable entity ID for the query name. Note that in our setup we do not need to perform direct look up for the query names that

exactly match one of canonical names in the reference set. If the query name is one of the canonical names in the reference set, it will have exactly the same embedding and zero distance with one of the reference set names.

1:procedure Retrieve(, , )
2:     Embed the query name:
3:     Find the closest to using approximate nearest neighbor search (Annoy) on
4:     return as the ID (i.e., )
Algorithm 3 NSEEN: Retrieval

4 Experimental Validation

We conduct two set of experiments mapping query names to their canonical names to empirically validate the effectiveness of our framework333We will release our code with the final version of the paper.. The two references sets are UniProt containing proteins, and ChEBI containing chemical entities, and the query set is from PubMed extracts provided by the BioCreative initiative, detailed in the following sections:

4.1 Reference Sets


Both of the reference sets we use in our experiments are publicly available on the internet, and are the authority of canonical entity representations in their domains.

UniProt.

The Universal Protein Resource (UniProt) is a large database of protein sequences and associated annotations [14]. For our experiments, we use the different names associated with each human protein in the UniProt dataset and their corresponding IDs. Hence, the task here is mapping a human protein name to a canonical UniProt ID.

ChEBI

We used the chemical entity names indexed in the Chemical Entities of Biological Interest (ChEBI). ChEBI is a dataset of molecular entities focused on ‘small’ chemical compounds, including any constitutionally or isotopically distinct atom, molecule, ion, ion pair, radical, radical ion, complex, conformer, identifiable as a separately distinguishable entity [15]. The task here is mapping a small molecule name to a canoncal ChEBI ID.

Table 1 depicts the total number of entities (i.e., ) and their corresponding ID–name pairs (i.e., ) in the reference sets, showing UniProt having less number of entities, but more names per entity comparing to ChEBI. Moreover, Figure 7 depicts the histogram that shows the distribution of the number of names per each entity in the reference sets. Note that there are no entities in the UniProt reference set with only one name, but there are many proteins with several names. In contrast, the ChEBI dataset contains many entities with only one name.

Datasets count count
UniProt (Human) 20,375 123,590
ChEBI 72,241 277,210
Table 1: Statistics of the entities in the reference sets
(a) UniProt
(b) ChEBI
Figure 7: Distribution of the number of names per each entity in the reference datasets. The proteins in the UniProt reference set are referred to with many different names compared to the chemical entities in the ChEBI dataset.

4.2 Query Set


We use the dataset provided by the BioCreative VI Interactive Bio-ID Assignment Track [16] as our query data. This dataset provides several types of bio-medical entity annotations provided by SourceData curators that map published article texts to their corresponding database IDs. The main interesting point about the BioCreative corpus for entity normalization is that the extracted entity names come from real-world published articles, and contain the entity-name variations and deviations forms that are present in reality.

The Bio-ID dataset includes both a training and test set. We use both of these datasets as query sets with gold standard labels to evaluate our method. The training set (we name it BC1) consists of 13,573 annotated figure panel captions corresponding to 3,658 figures from 570 full length articles from 22 journals, for a total of 102,717 annotations. The test data set (we name it BC2) consisted of 4,310 annotated figure panel captions from 1,154 figures taken from 196 full length journal articles, with 30,286 annotations in total [16].

Table 2 shows the number of UniProt and ChEBI entities in the annotated corpus. In our experiments we keep the original training (BC1) and test (BC2) splits of the data for reproducablility and ease of future comparisons, but we should note that for our purposes both BC1 and BC2 are just a source of correct normalizations. Our algorithm is not trained/tested in the usual sense on these datasets.

Dataset UniProt ChEBI
Total Unique Total Unique
BC1 30,211 2,833 9,869 786
BC2 1,592 1,321 829 543
Table 2: Statistics of the annotations in the BioCreative VI Bio-ID corpus

4.3 Baseline


We use the current production system for Named Entity Grounding at USC Information Science Institute, developed for DARPA Big Mechanism program, as the baseline. The system is an optimized solution that employs a tuned combination of several string similarities including Jaccard, Levenshtein, and JaroWinkler distances with a prefix-based blocking system. It also includes a post re-ranking of the results based on the domain knowledge such as the authenticity of the entity (e.g., if the protein entry in UniProt has been reviewed by a human or not), the matching between the ID components and the query name, and popularity of the entities in each domain. This system provides entity grounding for several bio-medical entities including Proteins and Chemicals, and is publicly available at [17]. The system can produce results based on FRIL [18] and Apache Lucene [19] that are widely used in real-world production environments, and we use the overall best results of both settings as the baseline for our experiments.

4.4 Results


Table 3 shows the comparative results of our method (i.e., NSEEN) with the baseline. We submit every query name in the BioCreative datasets to both systems, and retrieve the top k most probable IDs from each of them. We then find out if the correct ID (provided in the BioCreative dataset as labels) is present in the top k retrieved results (i.e., Hits@k) for several ks. We outperform the baseline in almost all settings. Chemical names are more sensitive to parenthesis, commas, and dashes, and harder for the baseline to normalize, while our method produces significantly better results.

DS Model H@1 H@3 H@5 H@10

UniProt

BC1 NSEEN 0.833 0.869 0.886 0.894
Baseline 0.814 0.864 0.875 0.885
BC2 NSEEN 0.861 0.888 0.904 0.930
Baseline 0.841 0.888 0.904 0.919

ChEBI

BC1 NSEEN 0.505 0.537 0.554 0.574
Baseline 0.418 0.451 0.460 0.468
BC2 NSEEN 0.578 0.608 0.624 0.641
Baseline 0.444 0.472 0.480 0.491
Table 3: Hits@N on BioCreative train set (BC1) and test set (BC2) datasets mapped to Uniprot and ChEBI reference sets.

Furthermore, Table 4 and the corresponding Figure 12 show example protein name queries mapped to UniProt reference set and the retrieved canonical names. Note that none of the query names exist in the UniProt reference set in the form provided as the query. Table 4 shows not only the syntactic variations being captured by our method in the Top 10 responses, but the semantically equivalent names is included as well. These responses can have a significantly large string distance with the query name. i.e., (S6K52 kDa ribosomal protein S6 kinase), (PLC2Phospholipase C-gamma-2), (IKKI-kappa-B kinase epsilon), and (H3Histone H3/a).

Figure 12 sheds more light to the embedding space and highlights the same four query names and the names corresponding to the correct entities in the UniProt reference set. As shown in this figure most of the correct responses (in blue) are clustered around the query name (in red).

S6K PLC2 IKK H3
[t]0.25 p70-S6K 1*
p90-RSK 6
S6K1*
p70 S6KA*
S6K-beta
p70 S6KB
90 kDa ribosomal protein S6 kinase 6
90 kDa ribosomal protein S6 kinase 5
52 kDa ribosomal protein S6 kinase*
RPS6KA6
[t]0.35 PLC-gamma-2*
PLC-gamma-1
PLCG2*
Phospholipase C-gamma-2*
Phospholipase C-gamma-1
PLC
PLCG1
Phosphoinositide phospholipase C-gamma-2*
PLC-IV*
PLCB
[t]0.25 IKK-epsilon*
IKKE*
I-kappa-B kinase epsilon*
IkBKE*
IKBKE*
IKBE
IK1
IK1
IKKG
INKA1
[t]0.15 Histone H3/a*
Histone H3/o*
Histone H3/m*
Histone H3/b*
Histone H3/f*
HIST1H3C*
Histone H3/k*
Histone H3/i*
HIST1H3G*
Histone H3/d*
Table 4: Example queries and their corresponding top-10 responses using our framework (NSEEN) on the UniProt reference set. The name of the correct entities are indicated with a bold font and an additional asterisk at the end. Note that none of the queries have an exact string match in the reference set. Correct semantic names in the top-10 responses that are not a close string match is also present in all four examples.
(a) S6K
(b) PLC2
(c) IKK
(d) H3
Figure 12: tSNE representation of the example UniPort query entities shown in Table 4 and the corresponding entities in the reference set. The queries are shown with red triangle and the correct entities of the reference set if shown via blue triangles. A sample of one thousand names from the reference set is shown with light grey dots to represent the embedding space. The bottom right insets show a zoomed version of the correct names clustered around the query name. (Best viewed in color)

5 Related Work

Linking the entities to their canonical forms is one of the most fundamental tasks in information retrieval and automatic knowledge extraction  [20]. Depending on type of the data and setting, this general task could be referred to as record linkage [21], de-duplication [22], or entity resolution [23]. The important part in our setting is the presence of a canonical reference set, that makes the question “which one of these canonical entities is referred to?” in contrast to “which one of these two records are the same?” in settings were the canonical entity is latent. Such settings are specially important in bio-medical domains [24].

String similarity [25] is at the core of most entity resolution methods. While several string similarity metrics are traditionally used in this domain, our approach learns a similarity metric based on syntactic and semantic information. We use a deep Siamese neural network that has been shown to be effective in learning similarities in text [2] and images [3]. Both of these approaches define a contrastive loss functions [4] to learn similarities.

To avoid exhaustive pairwise computation of similarities between entities often blocking [26] or indexing  [27] techniques are used to reduce the search space. These methods are often based on approximate string matching. The most effective methods in this area is based on hashing the string with the main purpose of blocking the entities, in contrast to our method, that the name embedding to a numerical space is aimed to capture similarities between entities. The hashing part in our method mainly approximates and preserves the original embedding space.

Recently, Joty and Tang [28] proposed a deep neural network solution for de-duplication of tuples (with multiple columns) in a database. Our setting differs from their approach as we operate on entity name strings, and have a canonical reference set to match the names with.

6 Discussion

In this paper, we proposed a general deep neural network based framework for entity normalization. We showed how to encode semantic information hidden in a reference set, and how to incorporate potential syntactic variations in the numeric embedding space via training-pair generation. In this process we showed how contrastive loss can be used with non-binary labels to capture uncertainty. We further introduced a dynamic hard negative sampling method to refine the embeddings. Finally, by transforming the traditional task of entity normalization to a standard k-nearest neighbors problem in a numerical space, we showed how to employ a scalable representation for fast retrievals that is applicable in real-world scenarios without the need of traditional entity blocking methods.

In our preliminary analysis, we experimented with different selection methods in the k-nearest neighbors retrieval process such as a top-k majority vote schema, but did not find them significantly effective in our setting. We also experimented with different soft labeling methods to dynamically re-rank the results such as soft re-labeling the k-nearest neighbors, but did not see much improvements to the overall performance.

While currently highly effective, our method could benefit from improving some of its components in future research. Currently, each set of generated training name pairs in our system enforces a type of syntactic or semantic similarity. The total number of pairs in each set is a hyper-parameter of our framework, that can bias the final embeddings towards different types of similarities. More principled approaches to tune them other than the empirical search could further improve the performance of our model.

Acknowledgments

The authors would like to thank Joel Mathew for his work on the implementation of classical record linkage baseline and providing baseline results. This work is partially supported by DARPA Big Mechanism program under contract number W911NF-14-1-0364.

References

  • Papadakis et al. [2016] George Papadakis, Jonathan Svirsky, Avigdor Gal, and Themis Palpanas. Comparative analysis of approximate blocking techniques for entity resolution. Proceedings of the VLDB Endowment, (9):684–695, 2016.
  • Neculoiu et al. [2016] Paul Neculoiu, Maarten Versteegh, and Mihai Rotaru. Learning text similarity with siamese recurrent networks. In Proceedings The 1st Workshop on Representation Learning for NLP, 2016.
  • Taigman et al. [2014] Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf. Deepface: Closing the gap to human-level performance in face verification. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pages 1701–1708, 2014.
  • Hadsell et al. [2006] R Hadsell, S Chopra, and Y LeCun. Dimensionality reduction by learning an invariant mapping. In Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, pages 1735–1742, 2006.
  • Cohen et al. [2003] William Cohen, Pradeep Ravikumar, and Stephen Fienberg. A comparison of string metrics for matching names and records. In KDD workshop on data cleaning and object consolidation, pages 73–78, 2003.
  • Chen et al. [2018] Haochen Chen, Bryan Perozzi, Yifan Hu, and Steven Skiena. Harp: Hierarchical representation learning for networks. 2018.
  • Kotnis and Nastase [2017] Bhushan Kotnis and Vivi Nastase. Analysis of the impact of negative sampling on link prediction in knowledge graphs. In WSDM 1st Workshop on Knowledge Base Construction, Reasoning and Mining (KBCOM), 2017.
  • Shrivastava et al. [2016] Abhinav Shrivastava, Abhinav Gupta, and Ross Girshick. Training region-based object detectors with online hard example mining. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 761–769, 2016.
  • Ponomarenko et al. [2014] Alexander Ponomarenko, Nikita Avrelin, Bilegsaikhan Naidan, and Leonid Boytsov. Comparative analysis of data structures for approximate nearest neighbor search. Data Analytics, pages 125–130, 2014.
  • Rastegari et al. [2013] Mohammad Rastegari, Jonghyun Choi, Shobeir Fakhraei, Daume Hal, and Larry Davis. Predictable dual-view hashing. In International Conference on Machine Learning, pages 1328–1336, 2013.
  • [11] Erik Bernhardsson et. al. Annoy (approximate nearest neighbors oh yeah). https://github.com/spotify/annoy.
  • Bachrach et al. [2014] Yoram Bachrach, Yehuda Finkelstein, Ran Gilad-Bachrach, Liran Katzir, Noam Koenigstein, Nir Nice, and Ulrich Paquet. Speeding up the xbox recommender system using a euclidean transformation for inner-product spaces. In Proceedings of the 8th ACM Conference on Recommender systems, pages 257–264, 2014.
  • Naidan and Boytsov [2015] Bilegsaikhan Naidan and Leonid Boytsov. Non-metric space library manual. arXiv preprint arXiv:1508.05470, 2015.
  • Apweiler et al. [2004] Rolf Apweiler, Amos Bairoch, Cathy H Wu, Winona C Barker, Brigitte Boeckmann, Serenella Ferro, Elisabeth Gasteiger, Hongzhan Huang, Rodrigo Lopez, Michele Magrane, et al. Uniprot: the universal protein knowledgebase. Nucleic acids research, 32:D115–D119, 2004.
  • Hastings et al. [2015] Janna Hastings, Gareth Owen, Adriano Dekker, Marcus Ennis, Namrata Kale, Venkatesh Muthukrishnan, Steve Turner, Neil Swainston, Pedro Mendes, and Christoph Steinbeck. Chebi in 2016: Improved services and an expanding collection of metabolites. Nucleic acids research, 44(D1):D1214–D1219, 2015.
  • Arighi et al. [2017] Cecilia Arighi, Lynette Hirschman, Thomas Lemberger, Samuel Bayer, Robin Liechti, Donald Comeau, and Cathy Wu. Bio-id track overview. In Proceedings of the BioCreative VI Workshop, 2017.
  • [17] University of Southern California - Information Science Institute Entity Grounding System. http://dna.isi.edu:7100/.
  • Jurczyk et al. [2008] Pawel Jurczyk, James J Lu, Li Xiong, Janet D Cragan, and Adolfo Correa. Fril: a tool for comparative record linkage. In American Medical Informatics Association (AMIA) annual symposium proceedings, 2008.
  • Białecki et al. [2012] Andrzej Białecki, Robert Muir, and Grant Ingersoll. Apache lucene 4. In SIGIR 2012 workshop on open source information retrieval, page 17, 2012.
  • Christen [2012a] Peter Christen. Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection. Springer Science & Business Media, 2012a.
  • Koudas et al. [2006] Nick Koudas, Sunita Sarawagi, and Divesh Srivastava. Record linkage: similarity measures and algorithms. In Proceedings of the 2006 ACM SIGMOD international conference on Management of data, pages 802–803, 2006.
  • Elmagarmid et al. [2007] Ahmed K Elmagarmid, Panagiotis G Ipeirotis, and Vassilios S Verykios. Duplicate record detection: A survey. IEEE Transactions on knowledge and data engineering, 19(1):1–16, 2007.
  • Getoor and Machanavajjhala [2012] Lise Getoor and Ashwin Machanavajjhala. Entity resolution: theory, practice & open challenges. Proceedings of the VLDB Endowment, 5(12):2018–2019, 2012.
  • Leaman and Lu [2016] Robert Leaman and Zhiyong Lu.

    Taggerone: joint named entity recognition and normalization with semi-markov models.

    Bioinformatics, 32(18):2839–2846, 2016.
  • Cheatham and Hitzler [2013] Michelle Cheatham and Pascal Hitzler. String similarity metrics for ontology alignment. In International Semantic Web Conference, pages 294–309, 2013.
  • Michelson and Knoblock [2006] Matthew Michelson and Craig A Knoblock. Learning blocking schemes for record linkage. In AAAI, pages 440–445, 2006.
  • Christen [2012b] Peter Christen. A survey of indexing techniques for scalable record linkage and deduplication. IEEE transactions on knowledge and data engineering, 24(9):1537–1555, 2012b.
  • Joty and Tang [2018] Muhammad Ebraheem Saravanan Thirumuruganathan Shafiq Joty and Mourad Ouzzani Nan Tang. Distributed representations of tuples for entity resolution. Proceedings of the VLDB Endowment, (11), 2018.