Mining Entity Synonyms with Efficient Neural Set Generation

11/16/2018 ∙ by Jiaming Shen, et al. ∙ University of Southern California Shanghai Jiao Tong University University of Illinois at Urbana-Champaign 0

Mining entity synonym sets (i.e., sets of terms referring to the same entity) is an important task for many entity-leveraging applications. Previous work either rank terms based on their similarity to a given query term, or treats the problem as a two-phase task (i.e., detecting synonymy pairs, followed by organizing these pairs into synonym sets). However, these approaches fail to model the holistic semantics of a set and suffer from the error propagation issue. Here we propose a new framework, named SynSetMine, that efficiently generates entity synonym sets from a given vocabulary, using example sets from external knowledge bases as distant supervision. SynSetMine consists of two novel modules: (1) a set-instance classifier that jointly learns how to represent a permutation invariant synonym set and whether to include a new instance (i.e., a term) into the set, and (2) a set generation algorithm that enumerates the vocabulary only once and applies the learned set-instance classifier to detect all entity synonym sets in it. Experiments on three real datasets from different domains demonstrate both effectiveness and efficiency of SynSetMine for mining entity synonym sets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Figure 1: SynSetMine Framework Overview.

An entity synonym set is a set of terms (i.e., words or phrases) that refer to the same real-world entity. For instance, {“USA”, “United States”, “U.S.”} is an entity synonym set as all terms in it refer to the same country. Entity synonym set discovery can benefit a wide range of applications such as web search [Cheng, Lauw, and Paparizos2012], question answering [Zhou et al.2013], and taxonomy construction [Anh, Kim, and Ng2015]. Take the query “Emirates Airline U.S. to UAE” as an example, understanding “U.S.” refers to “United States” and “UAE” stands for “United Arab Emirates” is crucial for an intelligent system to satisfy the user information need.

One line of work for entity synonym sets discovery takes a ranking plus pruning

approach. Given a query term referring to one entity, it first ranks all candidate terms based on their probabilities of referring to the same entity and then prunes the rank list into an output set. By treating each term in a vocabulary as a query, this approach can finally output all the entity synonym sets in the vocabulary. A variety of features are extracted, including corpus-level statistics

[Turney2001], textual patterns [Nakashole, Weikum, and Suchanek2012], or query contexts [Chakrabarti et al.2012], from different data sources (e.g., query log [Chaudhuri, Ganti, and Xin2009], web table [He et al.2016], and raw text corpus [Qu, Ren, and Han2017]) to calculate the above probabilities. However, this approach treats each candidate term separately and computes its probability of referring to the query entity independently. As a result, it ignores relations among candidate terms which could have helped to improve the quality of discovered synonym sets. Furthermore, as shown in [Ren and Cheng2015], the number of true synonyms varies a lot across different entities and thus converting a rank list into a set itself is a non-trivial problem and can be error-prone.

Another line of work divides the synonym set discovery problem into two sequential subtasks: (1) synonymy detection (i.e., finding term pairs of synonymy relation), and (2) synonymy organization (i.e., aggregating synonymous term pairs into synonym sets). Methods developed for synonymy detection leverage textual patterns [Wang and Hirst2012] and distributional word representations [Shwartz and Dagan2016] to train a classifier that predicts whether two terms hold the synonymy relation. Then, those predicted term pairs form a synonymy graph on which different graph clustering algorithms are applied [Hope and Keller2013, Oliveira and Gomes2014, Ustalov, Panchenko, and Biemann2017]. This approach is able to capture relations among candidate terms and returns all entity synonym sets in the vocabulary. However, these two-phase methods only use training data in their first phase and cannot leverage training signals in the second phase. Furthermore, the detected synonymous term pairs are usually fixed during synonymy organization and there is no feedback from the second phase to the first, which causes the error accumulation problem.

In this work, we propose a new framework, SynSetMine, which leverages existing synonym sets from a knowledge base as distant supervision and extracts more synonym sets not in knowledge bases from massive raw text corpus. Specifically, SynSetMine first applies an entity linker to map in-corpus text (i.e. entity mentions) to entities in the knowledge base. Then, it groups all mentions mapped to the same entity (with the same unique id) into an entity synonym set, which provides supervision signals. As these “training” synonym sets are created automatically and without any human effort, we refer to them as distant supervision.

To effectively leverage distant supervision signals, SynSetMine consists of two novel modules. First, we train a set-instance classifier which jointly learns how to represent an entity synonym set and whether to include a new instance (i.e., a term) into the set. This set-instance classifier can model a set holistically, instead of decomposing it into separated pairs. As a result, it effectively captures relations among candidate terms and directly leverages supervision signals from the set structure. Second, we design an efficient set generation algorithm that applies the learned set-instance classifier to discover new entity synonym sets. Given a vocabulary, this algorithm processes each term in it one at a time and maintains a pool of all detected sets. For each term in the vocabulary, the algorithm applies the set-instance classifier to determine whether and which previous detected set this term should reside in. If no matching set can be found, a new set is formed, consisting of this single term, and added into the pool of detected sets. As it only enumerates the vocabulary once to generate all synonym sets, this algorithm is efficient.

Contributions. This study makes three contributions: (1) a novel framework, SynSetMine, is proposed that leverages distant supervision for entity synonym set mining; (2) a set-instance classifier is designed to model entity synonym sets holistically and is integrated into a set generation algorithm to discover new synonym sets efficiently; and (3) extensive experiments conducted on three real-world datasets from different domains show the effectiveness of the method.

Problem Formulation

We first elaborate on some important concepts and then formulate the problem.

Entity Synonym Set. An entity synonym set is a set of terms (i.e., words or phrases) that refer to the same real-world entity.

Knowledge Base. A knowledge base consists of many facts about a set of entities. In this work, we focus on one particular type of facts: entity synonym. For some entities, their synonyms are manually curated and stored in a knowledge base. The knowledge base provides such training signals to help discover more entity synonym sets.

Problem Definition. Given a text corpus , a vocabulary (i.e., a list of terms) derived from , and a knowledge base , the task of mining entity synonym set aims to discover all entity synonym sets consisting of terms in , based on the information extracted from and .

The SynSetMine Framework

Our proposed SynSetMine framework consists of three main steps (Figure 1): (1) an entity linker is used to map in-corpus text (i.e., entity mentions) to entities in the given knowledge base, which provides some training entity synonym sets as distant supervision; (2) a classifier is constructed to determine whether a term should be included into the synonym set, based on the above distant supervision; and (3) the learned classifier is integrated into a set generation algorithm which outputs all term clusters in the vocabulary as detected entity synonym sets.

Distant Supervision Acquisition

A knowledge base contains a collection of curated entities with their known synonyms. These entities can provide distant supervision signals to help discover more entity synonym sets that are not in the knowledge base from raw text corpus. To automatically acquire the distant supervision, we first apply an existing entity linker such as DBpedia Spotlight [Mendes et al.2011] which directly maps in-corpus text (i.e., entity mentions) to entities in the knowledge base. However, most entity linkers are not perfect and heavily rely on the string-level features which could introduce additional noise, as shown in [Shen et al.2018] and the example below. Therefore, we follow the same procedure in [Qu, Ren, and Han2017] to reduce the linking errors. Specifically, for each entity mention and its linked entity, if the mention surface string is not in that entity’s synonym set (in the knowledge base), we remove the link between them. Finally, we group all entity mentions that linked to the same entity as a training entity synonym set, and collect all synonym sets from the linked corpus as distant supervision.

Example 1.

Given a sentence “The U.S. Steel, located in Pennsylvania, USA, is a leading steel producer in America”, an entity linker may first map “U.S. Steel”, “USA” and “America” to the entity “United States”. Then, we find that the synonym set of entity “United States”, retrieved from knowledge base, does not contain the surface string “U.S. Steel”. Therefore, we remove the link between them and group the remaining two entity mentions into an entity synonym set {“USA”, “America”}.

Learning Set-Instance Classifier

After obtaining distant supervision, we train a set-instance classifier, denoted as , to determine whether a synonym set should include an instance (i.e., a term).

Set-Instance Classifier Architecture.

One key requirement of the set-instance classifier is that its output should be invariant to the ordering of elements in set . An intuitive way to achieve such permutation invariance property is to decompose the set-instance prediction into multiple instance-instance pair predictions, as shown in Figure 2. Each pair prediction decides whether two instances are synonyms, and all pair prediction results are finally aggregated into the set-instance prediction result. However, this method completely ignores the relations between elements in set .

In this work, we are inspired by [Zaheer et al.2017]

and design a neural network architecture that directly learns to represent the permutation invariant set. The architecture of our set-instance classifier is shown in Figure

3. The bottom part of Figure 3 shows a set scorer which takes a set as input, and returns a quality score that measures how complete and coherent this set is. Given a synonym set and an instance term , our set-instance classifier first applies the set scorer to obtain the quality score of input set (i.e., ). Then, we add the instance into the set and apply the set scorer again to obtain the quality score of . Finally, we calculate the difference between these two quality scores, and transform this score difference into the probability using a sigmoid unit as follows:

(1)

where

is the sigmoid function.

Given a collection of set-instance pairs with their corresponding labels , we learn the set-instance classifier using the log-loss as follows:

(2)

where equals to 1 if and equals to 0 otherwise.

Note that if the set scorer is invariant to the ordering of elements in its input set, our set-instance classifier built on it will naturally satisfy the permutation invariance property. Following we first describe the set scorer architecture (the bottom part of Figure 3) and then discuss how to generate set-instance pairs from distant supervision.

Figure 2: Aggregating instance-instance pair prediction results into set-instance prediction result.

Figure 3: Architecture of our set-instance classifier.

Set Scorer Architecture.

Given a set of terms , the set scorer first passes each term into an embedding layer and obtains its embedding . Then, an “embedding transformer” is applied to transform the raw embedding to a new term representation . After that, we sum all transformed term representations and obtain the “raw” set representation . Finally, we feed this set representation into the “post-transformer” which outputs the final set quality score. Since the summation operation is commutative, the “raw” set representation is invariant to the ordering of elements in and so is the set-scorer .

In this work, we initialize the embedding layer using term embeddings pre-trained on the given corpus. We instantiate the “embedding transformer” using a fully connected neural network with two hidden layers. For the “post transformer”, we construct it using another fully connected neural network with three hidden layers. We demonstrate the necessity of these two transformers and study how the size of hidden layers may influence model performance in later experiments.

Generation of Training Set-Instance Pairs.

To train the set-instance classifier , we need to first generate a collection of training set-instance pairs from training synonym sets. For each entity synonym set , we randomly holdout one instance and construct a positive set-instance sample where . Then, for each positive sample , we generate negative samples by selecting negative instances and pair each of them with . To generate each negative instance , we can either randomly choose a term from the vocabulary (denoted as complete-random strategy); select a term that shares same token with some string in (denoted as share-token strategy), or combine these two methods (denoted as mixture strategy). We study how this negative sample size and the sampling strategy influence the model performance in later experiments.

Set Generation Algorithm

We present our designed set generation algorithm for mining entity synonym set in Algorithm 1. This algorithm takes the above learned set-instance classifier, a vocabulary , and a probability threshold as input, and clusters all terms in the vocabulary into entity synonym sets. Specifically, this algorithm enumerates the vocabulary once and maintains a pool of all detected sets . For each term , it applies the set-instance classifier to calculate the probability of adding this term into each detected set in , and finds the best set that has the largest probability, If this probability value passes the threshold , we will add into set . Otherwise, we create a new set with this single term and add it into the set pool . The entire algorithm stops after one pass of the vocabulary and returns all detected sets in . Note that we do not need to specify the number of entity synonym sets in the vocabulary and our set generation algorithm will determine this value on its own. In this work, we simply set the probability threshold to be 0.5 and study its influence on clustering performance below.

Input: A set-instance classifier ; An input vocabulary ; A threshold .
Output: entity synonym sets where , , .
1 ; // initialize the first single-element cluster;
2 for  from 2 to  do
3        = 0;
4        = 1;
5        for  from 1 to  do
6               if  then
7                      ;
8                      ;
9                     
10       if  then
11               ;
12              
13       else
14               .append({}); //add a new cluster into the output;
15              
16Return ;
Algorithm 1 Set Generation Algorithm

Complexity Analysis.

To compare with the clustering algorithm that requires the input cluster number, we suppose our algorithm will eventually return clusters. Then, our algorithm applies the set-instance classifier at most times, where denotes the vocabulary size. Most computation efforts of set-instance classifier are the matrix multiplication in two transformers, which can be accelerated by GPU. As a comparison, the time complexity of -means algorithm is where denotes the number of iterations, and for most supervised clustering methods (such as [Finley and Joachims2005]) and two-phase methods (such as WATSET [Ustalov, Panchenko, and Biemann2017]), their time complexity is as they need to explicitly calculate all pairwise similarities. Finally, we emphasize that we can further parallelize lines 5-8 in Algorithm 1 by grouping all set-instance pairs into a batch and applying the set-instance classifier only once. In practice, this strategy can significantly reduce the running time.

Experiments

In this section, we first describe our experimental setup, and then report the experimental results. Finally, we analyze each component of SynSetMine in more details and show several concrete case studies. Our model implementation is available at: https://github.com/mickeystroller/SynSetMine-pytorch.

Experimental Setup

Datasets. We evaluate SynSetMine on three public benchmark datasets used in [Qu, Ren, and Han2017]:

  1. Wiki contains 100K articles in the Wikipedia. We use Freebase111https://developers.google.com/freebase/ as the knowledge base.

  2. NYT includes about 119K news articles from 2013 New York Times. We use Freebase as the knowledge base.

  3. PubMed contains around 1.5M paper abstracts from PubMed222https://www.ncbi.nlm.nih.gov/pubmed. We select UMLS333https://www.nlm.nih.gov/research/umls/ as the knowledge base.

For Wiki and NYT datasets, DBpedia Spotlight444https://github.com/dbpedia-spotlight/dbpedia-spotlight is used as the entity linker, and for PubMed, we apply PubTator555https://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator/ as the entity linker. After the linking step, we randomly sample a portion of linked entities as test entities and treat the remaining as training entities. Note that there is no overlapping between training vocabulary and testing vocabulary, which makes the evaluation more realistic but also more challenging. The statistics of these datasets are listed in Table 1. All datasets are available at: http://bit.ly/SynSetMine-dataset.

Dataset Wiki NYT PubMed
#Documents 100,000 118,664 1,554,433
#Sentences 6,839,331 3,002,123 15,051,203
#Terms in train 8,731 2,600 72,627
#Synonym sets in train 4,359 1,273 28,600
#Terms in test 891 389 1,743
#Synonym sets in test 256 117 250
Table 1: Datasets Statistics.
Method Wiki NYT PubMed
ARI (std) FMI (std) NMI (std) ARI (std) FMI (std) NMI (std) ARI (std) FMI (std) NMI (std)
Kmeans 34.35 (1.06) 35.47 (0.96) 86.98 (0.27) 28.87 (1.98) 30.85 (1.76) 83.71 (0.57) 48.68 (1.93) 49.86 (1.79) 88.08 (0.45)
Louvain 42.25 (0.00) 46.48 (0.00) 92.58 (0.00) 21.83 (0.00) 30.58 (0.00) 90.13 (0.00) 46.58 (0.00) 52.76 (0.00) 90.46 (0.00)
SetExpan+Louvain 44.78 (0.28) 44.95 (0.28) 92.12 (0.02) 43.92 (0.90) 44.31 (0.93) 90.34 (0.11) 58.91 (0.08) 61.87(0.07) 92.23 (0.15)
COP-Kmeans 38.80 (0.51) 39.96 (0.49) 90.31 (0.15) 33.80 (1.94) 34.57 (2.06) 87.92 (0.30) 49.12 (0.85) 51.92 (0.83) 89.91 (0.15)
SVM+Louvain 6.03 (0.73) 7.75 (0.81) 25.43 (0.13) 3.64 (0.42) 5.10 (0.39) 21.02 (0.27) 7.76 (0.96) 8.79 (1.03) 31.08 (0.34)
L2C 12.87 (0.22) 19.90 (0.24) 73.47 (0.29) 12.71 (0.89) 16.66 (0.68) 70.23 (1.20)
SynSetMine 56.43 (1.31) 57.10 (1.17) 93.04 (0.23) 44.91 (2.16) 46.37 (1.92) 90.62 (1.53) 74.33 (0.66) 74.45 (0.64) 94.90 (0.97)
Table 2:

Quantitative results of entity synonym set mining. All metrics are in percentage scale. We run all methods except L2C five times and report the averaged score with standard deviation. Due to the bad scalability of L2C, we have not obtain its results on PubMed dataset within 120h, and indicate this using “–” mark.

Compared Methods. We select the following algorithms to compare with our method.

  1. Kmeans: An unsupervised clustering algorithm which takes term embedding as features and returns detected synonym sets as clusters. This algorithm requires a predefined cluster number and we set its value to the oracle number of clusters for each dataset.

  2. Louvain [Blondel et al.2008]666https://github.com/taynaud/python-louvain

    : An unsupervised community detection algorithm which takes a graph as input and returns discovered graph communities. To apply this algorithm, we first construct a term graph where each node represents a term. Then, we calculate the cosine similarity between each pair of term embeddings, and if the similarity is larger than a threshold

    , we will add an edge into the graph. We tune this threshold on training set.

  3. SetExpan [Shen et al.2017]777https://github.com/mickeystroller/SetExpan+Louvain: A two-phase unsupervised approach that first uses SetExpan (i.e., a weakly-supervised set expansion algorithm) to find each term’s nearest neighbors in embedding space, and then construct a -NN graph on which the above Louvain algorithm is applied. We tune the variable on training set.

  4. COP-Kmeans [Wagstaff et al.2001]888https://github.com/Behrouz-Babaki/COP-Kmeans: A semi-supervised variation of the Kmeans algorithm that can incorporate pairwise constraints (e.g., two elements must or cannot be clustered together) and output clusters that satisfy all constraints. We convert training synonym sets into these pairwise constraints and set the oracle number of clusters for each dataset.

  5. SVM999http://scikit-learn.org/stable/modules/svm.html+Louvain: A two-phase supervised approach which first uses a SVM for synonym pair prediction and then groups all predicted pairs into a graph where the Louvain algorithm is applied. The SVM is learned on training set.

  6. L2C [Hsu, Lv, and Kira2018]101010https://github.com/GT-RIPL/L2C: A supervised clustering method that learns a pairwise similarity prediction neural network and a constrained clustering network on training synonym sets, then applies the learned networks on test vocabulary to detect new entity synonym sets.

  7. SynSetMine: Our proposed approach which trains a set-instance classifier and integrates it seamlessly into an efficient set generation algorithm.

Parameter Settings and Initial Term Embedding. For a fair comparison, we use the same 50-dimension term embedding, trained on each corpus, across all compared methods. We tune hyper-parameters in all (semi-)supervised algorithms using 5-fold cross validation on training set. For SynSetMine, we use a neural network with two hidden layers (of sizes 50, 250) as embedding transformer, and another neural network with three hidden layers (of sizes 250, 500, 250) as post transformer (c.f. Figure 3). We optimize our model using Adam with initial learning rate 0.001 and apply dropout technique with dropout rate 0.5. For the set generation algorithm, we set the probability threshold be 0.5. We will discuss the influence of these hyper-parameters later.

Evaluation Metrics.

As all compared methods output entity synonym sets in the form of clusters, we evaluate them using three standard clustering evaluation metrics.

  • ARI measures the similarity of two cluster assignments. Given the ground truth cluster assignment and model predicted cluster assignment , we use () to denote the number of element pairs that are in the same (different) cluster(s) in both and , respectively. We denote the total number of element pairs in as , and then calculate ARI as follows:

    where is the expected RI of random assignments.

  • FMI is another similarity measure of two cluster assignments. Besides the above , we use () to denote the number of element pairs that belong to the same clusters in () but in different clusters in (), respectively. Then, we calculate FMI as follows:

  • NMI calculates the normalized mutual information between two cluster assignments. This metric is widely used in literature and its calculation details can be found in [Nguyen, Epps, and Bailey2009].

Experimental Results

Clustering Performance. We first compare all methods in terms of the clustering performance. Results are shown in Table 2. We can see that overall SynSetMine outperforms baseline methods across all three datasets. The main disadvantage of unsupervised methods (Kmeans, Louvain, and SetExpan+Louvain) is that they cannot utilize supervision signals, which limits their performance when supervision is available. Compared with Kmeans, COP-Kmeans leverages additional supervision information from the training set and thus has an improvement in performance. However, the incorporation of supervision signals is not always straightforward. We find that the SVM+Louvain method fails to capture the synonymy relation, probably due to the limited expressive power of SVM. Also, the supervised clustering method L2C also does not work very well on synonym datasets regarding both efficiency and the quality of returned clusters. The reason is that during the training stage, L2C needs to calculate each example’s class distribution, and this computation effort is proportional to the training cluster number which is in the scale of thousands111111For comparison, L2C originally runs on MNIST and CIFAR-10 datasets where examples with the same class label is viewed as a cluster and the cluster number is of scale 10-100.. Another major deficiency of both SVM+Louvain and L2C is that their learned model is based on pairwise similarity and does not have a holistic view of set. On the other hand, SynSetMine encodes synonym sets directly and therefore is able to capture set-level features beyond pairwise similarity.

Set-instance Pair Prediction Performance. To further analyze the importance of encoding set structure, we compare our method with the approach that averages instance-instance pair prediction results, as shown in Figure 2. Specifically, given a trained set-instance classifier , we first construct an instance-instance classifier to determine whether two instances should be put into the same set. We let to be the mean of and . Then, for each set-instance pair , we apply the classifier to obtain all instance pair prediction results (i.e., ) and average them into the final prediction.

For evaluation, we convert the testing synonym sets in PubMed dataset into a collection of 3486 set-instance pairs, among which 1743 pairs are positive. Results are shown in Figure 4. We can see clearly that the set-encoding based approach performs much better than the “averaging pair prediction” approach in terms of both accuracy and F1 score. This demonstrates the importance of capturing entity relations among the set and modeling the set structure holistically.

Figure 4: Evaluation of set-instance classifiers on PubMed dataset using (a) Accuracy and (b) F1 score.

Efficiency Analysis.

We implement our model based on the PyTorch library, same as the L2C baseline. We train two neural models (

SynSetMine and L2C) on one Quadro P4000 GPU and run all the other methods on CPU. Results are shown in Table 3. Compared with the other supervised neural model L2C, our method can predict significantly faster. In addition, our set generation algorithm avoids explicitly computing all pairwise term similarities and thus is more efficient during the prediction stage.

Method Training Prediction
Wiki NYT PubMed Wiki NYT PubMed
Kmeans 1.82s 0.88s 2.95s
Louvain 3.94s 20.59s 74.6s
SetExpan+Louvain 323s 120s 4143s
COP-KMeans 249s 37.94s 713s
SVM+Louvain 4.9m 37s 1.3h 29.21s 5.80s 101.32s
L2C 16.8h 30.7m 120h 20.9m 56.6s
SynSetMine 48m 6.5m 7.5h 0.852s 0.348s 1.84s
Table 3: Efficiency analysis. Model and data loading time are excluded. GPU time are marked with . We use “–” to indicate that either there is no training time for unsupervised methods, or the model is too slow to train and thus we don’t get one trained model for prediction.

Model Analysis and Cases Studies

Following we conduct more experiments to analyze each component of our SynSetMine framework in more details and show some concrete case studies.

Effect of training set-instance pair generation strategy and negative sample size. In order to train the set-instance classifier, we need to first convert the training entity synonym sets (obtained from distant supervision) to a collection of set-instance pairs. We study how such conversion strategy and different negative sample sizes will affect the model performance on Wiki dataset. Results are shown in Figure 5(a). First, we find that the complete-random strategy actually performs better than the share-token strategy. One possible explanation is that the complete-random strategy can generate more diverse negative samples and thus provide more supervision signals. Second, we observe that by combining the above two strategies and generating more mixed negative samples, our model can be further improved, which again may contribute to the diversity of negative samples.

Effects of different set-scorer architectures. To demonstrate the necessity of our model components, we first compare the current set-scorer architecture (c.f. Figure 3) with its two ablations. Specifically, we leave either the Embedding Transformer (ET) or the Post Transformer (PT) out, and test the performance of remaining models. As shown in Table 4, both ET and PT are essential to our model, and removing either of them will significantly damage our model’s performance. Furthermore, the Post Transformer, which operates on the set representation, is particularly important to our model.

To further explore the effect of above two transformers, we train our model with different hidden layer sizes. Specifically, we use Both-X-Y to denote a set-scorer composed of an Embedding Transformer (of hidden sizes {50, X}) and a Post Transformer (of hidden sizes {X, Y, X}). Results are shown in Table 4. We discover that the performance first keeps increasing when the hidden layer size grows and then drops slightly. Also, the best model sizes are consistent for both Wiki and NYT datasets.

Method Wiki NYT
ARI FMI NMI ARI FMI NMI
No-ET 46.48 47.23 91.57 39.86 42.67 90.46
No-PT 1.50 0.50 89.95 0.82 1.70 82.20
Both-100-200 49.38 49.56 91.21 37.64 39.37 89.33
Both-150-300 53.06 53.27 91.96 43.20 44.08 89.57
Both-200-400 53.82 53.99 92.36 47.03 49.65 91.00
Both-250-500 57.34 58.13 93.10 48.89 51.33 91.19
Both-300-600 56.26 56.51 92.92 46.65 47.30 90.01
Both-350-600 55.93 56.10 92.69 47.40 48.37 90.14
Table 4: Analysis of set scorer architecture. “No-ET” means no Embedding Transformer (ET) module, “No-PT” means no Post Transformer (PT) module, and Both-X-Y stands for using both ET (of hidden sizes {50, X}) and PT (of hidden sizes {X, Y, X}).

Effect of probability threshold . As shown in Algorithm 1, our set generation algorithm requires an input probability threshold to determine whether a term should be added into one of the already detected sets. Intuitively, the higher this threshold is, the more conservative our set generation algorithm will be and more sets will be generated. In all the above experiments, we set the threshold to be 0.5. In this experiment, we intend to empirically study how this hyper-parameter will influence the clustering performance and how SynSetMine is sensitive to the choice of . Therefore, we run our set generation algorithm with fixed set-instance classifier and varied thresholds. The results are shown in Figure 5(b). First, we notice that the performance of our clustering algorithm is insensitive to and a value within 0.4 and 0.6 is generally good for . Second, we find that setting to be 0.5 is robust and works well across all three datasets.

(a) Negative sample size
(b) Probability threshold
Figure 5: Hyper-parameters analysis.

Case Studies. Table 5 presents some example outputs of SynSetMine. We can see that our method is able to detect different types of entity synonym sets across different domains. Then, in Table 6, we show a concrete case comparing our set-instance classifier with the approach that aggregates the instance-instance pair predictions. Clearly, the set encoding based method can detect more accurate entity synonyms.

Related Work

There are two lines of related work, including entity synonym discovery and set representation learning.

Entity Synonym Discovery. Most of previous efforts on entity synonym discovery focus on discovering entity synonyms from (semi-)structured data such as web table schemas [Cafarella et al.2008] and query logs [Chaudhuri, Ganti, and Xin2009, Ren and Cheng2015]. In this work, we aim to mine entity synonym sets directly from raw text corpus, which has a border application scope. Given a corpus, one can leverage co-occurrence statistics [Baroni and Bisi2004], textual pattern [Nakashole, Weikum, and Suchanek2012, Yahya et al.2014], distributional similarity [Pantel et al.2009, Wang et al.2015], or their combinations [Qu, Ren, and Han2017] to extract synonyms. These methods, however, only find synonymous term pairs or a rank list of query entity’s synonym, instead of entity synonym sets. Some work attempt to further cut-off the rank list into a set output [Ren and Cheng2015] or to build a synonym graph and then apply graph clustering techniques to derive synonym sets [Oliveira and Gomes2014, Ustalov, Panchenko, and Biemann2017]. However, these two-phase approaches suffer from the noise accumulation problem, and cannot directly model the entity set structure. Comparing to them, our approach can model entity synonym sets holistically and capture important relations among terms in the set. Finally, there exist some studies, such as finding lexical synonyms from dictionary [Ustalov et al.2017] or attribute synonyms from query logs [He et al.2016], but this work focuses on mining entity synonyms from raw text corpora.

Dataset Distant Supervision Discovered Synonym Sets
Wiki {“londres”, “london”} {“gas”, “gasoline”, “petrol”}
{“mushroom”, “toadstool”} {“roman fort”, “castra”}
NYT {“myanmar”, “burma”} { “royal dutch shell plc”,
royal dutch shell”, “shell” }
{“honda motor”, “honda”} {“chief executive officier”, “ceo” }
PubMed {“alzheimers disease”, {“dna microarrays”, “dna chip”,
Alzheimer’s dementia”} gene expression array”, “dna array”}
Table 5: Example outputs on three datasets.

Set Representation Learning. Our work is also related to set representation learning, which aims to construct permutation invariant representations of sets. PointNet [Qi et al.2017]

models points in a space as sets and uses multi-layer perceptrons and max pooling function to learn their representation. DeepSets

[Zaheer et al.2017] establishes the general form of permutation invariant functions and proposes to learn these functions using a deep neural network consisting of multi-layer perceptrons and aggregation functions. AutoEncSets [Hartford et al.2018] further extends DeepSets by modeling the interactions across sets. In this work, in addition to just representing given sets, we go beyond one step and aim to predict sets (i.e., entity synonym sets) from the vocabulary.

Method Set-instance Classifier Aggregate Pair Predictions
Synonym set {“u.k.”, “britain”} {“u.k.”, “britain”}
Ranked terms uk uk
united kingdom indie
great britain united kingdom
elizabeth ii america
Table 6: Comparison of set-instance classifier with the approach that aggregates instance-instance pair predictions.

Conclusions

In this paper, we study how to mine entity synonym sets from raw text corpus. We propose a framework named SynSetMine which effectively leverages distant supervision from knowledge bases to learn a set-instance classifier and integrates it in an efficient set generation algorithm to detect new entity synonym sets. Extensive experiments on three real-world datasets demonstrate both effectiveness and efficiency of our framework. In the future, we plan to further integrate set-instance classifier into the set generation algorithm and learn both of them in an end-to-end fashion.

Acknowledgements

This research is sponsored in part by U.S. Army Research Lab. under Cooperative Agreement No. W911NF-09-2-0053 (NSCTA), DARPA under Agreement No. W911NF-17-C-0099, National Science Foundation IIS 16-18481, IIS 17-04532, and IIS-17-41317, DTRA HDTRA11810026, and grant 1U54GM114838 awarded by NIGMS through funds provided by the trans-NIH Big Data to Knowledge (BD2K) initiative (www.bd2k.nih.gov). Xiang Ren’s research has been supported in part by National Science Foundation SMA 18-29268. We thank Meng Qu for providing the original synonym discovery datasets and anonymous reviewers for valuable feedback.

References

  • [Anh, Kim, and Ng2015] Anh, T. L.; Kim, J.-j.; and Ng, S. K. 2015. Incorporating trustiness and collective synonym/contrastive evidence into taxonomy construction. In EMNLP.
  • [Baroni and Bisi2004] Baroni, M., and Bisi, S. 2004. Using cooccurrence statistics and the web to discover synonyms in a technical language. In LREC.
  • [Blondel et al.2008] Blondel, V. D.; Guillaume, J.-L.; Lambiotte, R.; and Lefebvre, E. 2008. Fast unfolding of communities in large networks.
  • [Cafarella et al.2008] Cafarella, M. J.; Halevy, A. Y.; Wang, D. Z.; Wu, E.; and Zhang, Y. 2008. Webtables: exploring the power of tables on the web. PVLDB 1:538–549.
  • [Chakrabarti et al.2012] Chakrabarti, K.; Chaudhuri, S.; Cheng, T.; and Xin, D. 2012. A framework for robust discovery of entity synonyms. In KDD.
  • [Chaudhuri, Ganti, and Xin2009] Chaudhuri, S.; Ganti, V.; and Xin, D. 2009. Exploiting web search to generate synonyms for entities. In WWW.
  • [Cheng, Lauw, and Paparizos2012] Cheng, T.; Lauw, H. W.; and Paparizos, S. 2012. Entity synonyms for structured web search. IEEE Transactions on Knowledge and Data Engineering 24:1862–1875.
  • [Finley and Joachims2005] Finley, T., and Joachims, T. 2005.

    Supervised clustering with support vector machines.

    In ICML.
  • [Hartford et al.2018] Hartford, J. S.; Graham, D. R.; Leyton-Brown, K.; and Ravanbakhsh, S. 2018. Deep models of interactions across sets. In ICML.
  • [He et al.2016] He, Y.; Chakrabarti, K.; Cheng, T.; and Tylenda, T. 2016. Automatic discovery of attribute synonyms using query logs and table corpora. In WWW.
  • [Hope and Keller2013] Hope, D. R., and Keller, B. 2013. Maxmax: A graph-based soft clustering algorithm applied to word sense induction. In CICLing.
  • [Hsu, Lv, and Kira2018] Hsu, Y.-C.; Lv, Z.; and Kira, Z. 2018. Learning to cluster in order to transfer across domains and tasks. In ICLR.
  • [Mendes et al.2011] Mendes, P. N.; Jakob, M.; García-Silva, A.; and Bizer, C. 2011. Dbpedia spotlight: shedding light on the web of documents. In I-SEMANTICS.
  • [Nakashole, Weikum, and Suchanek2012] Nakashole, N.; Weikum, G.; and Suchanek, F. M. 2012. Patty: A taxonomy of relational patterns with semantic types. In EMNLP-CoNLL.
  • [Nguyen, Epps, and Bailey2009] Nguyen, X. V.; Epps, J.; and Bailey, J. 2009. Information theoretic measures for clusterings comparison: is a correction for chance necessary? In ICML.
  • [Oliveira and Gomes2014] Oliveira, H. G., and Gomes, P. 2014. Eco and onto.pt: a flexible approach for creating a portuguese wordnet automatically. Language Resources and Evaluation 48:373–393.
  • [Pantel et al.2009] Pantel, P.; Crestan, E.; Borkovsky, A.; Popescu, A.-M.; and Vyas, V. 2009. Web-scale distributional similarity and entity set expansion. In EMNLP.
  • [Qi et al.2017] Qi, C. R.; Su, H.; Mo, K.; and Guibas, L. J. 2017.

    Pointnet: Deep learning on point sets for 3d classification and segmentation.

    In CVPR.
  • [Qu, Ren, and Han2017] Qu, M.; Ren, X.; and Han, J. 2017. Automatic synonym discovery with knowledge bases. In KDD.
  • [Ren and Cheng2015] Ren, X., and Cheng, T. 2015. Synonym discovery for structured entities on heterogeneous graphs. In WWW.
  • [Shen et al.2017] Shen, J.; Wu, Z.; Lei, D.; Shang, J.; Ren, X.; and Han, J. 2017.

    Setexpan: Corpus-based set expansion via context feature selection and rank ensemble.

    In ECML/PKDD.
  • [Shen et al.2018] Shen, J.; Xiao, J.; He, X.; Shang, J.; Sinha, S.; and Han, J. 2018. Entity set search of scientific literature: An unsupervised ranking approach. In SIGIR.
  • [Shwartz and Dagan2016] Shwartz, V., and Dagan, I. 2016. Cogalex-v shared task: Lexnet - integrated path-based and distributional method for the identification of semantic relations. In CogALex@COLING.
  • [Turney2001] Turney, P. D. 2001. Mining the web for synonyms: Pmi-ir versus lsa on toefl. In ECML.
  • [Ustalov et al.2017] Ustalov, D.; Chernoskutov, M.; Biemann, C.; and Panchenko, A. 2017. Fighting with the sparsity of synonymy dictionaries for automatic synset induction. In AIST.
  • [Ustalov, Panchenko, and Biemann2017] Ustalov, D.; Panchenko, A.; and Biemann, C. 2017. Watset: Automatic induction of synsets from a graph of synonyms. In ACL.
  • [Wagstaff et al.2001] Wagstaff, K.; Cardie, C.; Rogers, S.; and Schrödl, S. 2001.

    Constrained k-means clustering with background knowledge.

    In ICML.
  • [Wang and Hirst2012] Wang, T., and Hirst, G. 2012. Exploring patterns in dictionary definitions for synonym extraction. Natural Language Engineering 18:313–342.
  • [Wang et al.2015] Wang, H.; Gao, B.; Bian, J.; Tian, F.; and Liu, T.-Y. 2015. Solving verbal comprehension questions in iq test by knowledge-powered word embedding. CoRR abs/1505.07909.
  • [Yahya et al.2014] Yahya, M.; Whang, S. E.; Gupta, R.; and Halevy, A. Y. 2014. Renoun: Fact extraction for nominal attributes. In EMNLP.
  • [Zaheer et al.2017] Zaheer, M.; Kottur, S.; Ravanbakhsh, S.; Póczos, B.; Salakhutdinov, R.; and Smola, A. J. 2017. Deep sets. In NIPS.
  • [Zhou et al.2013] Zhou, G.; Liu, Y.; Liu, F.; Zeng, D.; and Zhao, J. 2013. Improving question retrieval in community question answering using world knowledge. In IJCAI.