1 Introduction††footnotetext: Author contributions: Hao Zhu designed the research; Weize Chen prepared the data, and organized data annotation; Hao Zhu and Xu Han designed the experiments; Weize Chen performed the experiments; Hao Zhu, Weize Chen and Xu Han wrote the paper; Zhiyuan Liu and Maosong Sun proofread the paper. Zhiyuan Liu is the corresponding author.
Relations111Sometimes relations are also named properties., representing various types of connections between entities or arguments, are the core of expressing relational facts in most general knowledge bases (KBs) Suchanek et al. (2007); Bollacker et al. (2008). Hence, identifying relations is a crucial problem for several information extraction tasks. Although considerable effort has been devoted to these tasks, some nuances between similar relations are still overlooked, (Table 1 shows an example); on the other hand, some distinct surface forms carrying the same relational semantics are mistaken as different relations. These severe problems motivate us to quantify the similarity between relations in a more effective and robust method.
|Sentence||The crisis didn’t influence his two daughters OBJ and SUBJ.|
In this paper, we introduce an adaptive and general framework for measuring similarity of the pairs of relations. Suppose for each relation , we have obtained a conditional distribution, ( are head and tail entities, and is a relation), over all head-tail entity pairs given
. We could quantify similarity between a pair of relations by the divergence between the conditional probability distributions given these relations. In this paper, this conditional probability is given by a simple feed-forward neural network, which can capture the dependencies between entities conditioned on specific relations. Despite its simplicity, the proposed network is expected to cover various facts, even if the facts are not used for training, owing to the good generalizability of neural networks. For example, our network will assign a fact a higher probability if it is “logical”: e.g., the network might prefer an athlete has the same nationality as same as his/her national team rather than other nations.
Intuitively, two similar relations should have similar conditional distributions over head-tail entity pairs , e.g., the entity pairs associated with be trade to and play for are most likely to be athletes and their clubs, whereas those associated with live in are often people and locations. In this paper, we evaluate the similarity between relations based on their conditional distributions over entity pairs. Specifically, we adopt Kullback–Leibler (KL) divergence of both directions as the metric. However, computing exact KL requires iterating over the whole entity pair space , which is quite intractable. Therefore, we further provide a sampling-based method to approximate the similarity score over the entity pair space for computational efficiency.
Besides developing a framework for assessing the similarity between relations, our second contribution is that we have done a survey of applications. We present experiments and analysis aimed at answering five questions:
(1) How well does the computed similarity score correlate with human judgment about the similarity between relations? How does our approach compare to other possible approaches based on other kinds of relation embeddings to define a similarity? (§ 3.4 and § 5)
(2) Open IE models inevitably extract many redundant relations. How can our approach help reduce such redundancy? (§ 6)
(3) To which extent, quantitatively, does best relational classification models make errors among similar relations? (§ 7)
(4) Could similarity be used in a heuristic method to enhance negative sampling for relation prediction? (§ 8)
(5) Could similarity be used as an adaptive margin in softmax-margin training method for relation extraction? (§ 9)
Finally, we conclude with a discussion of valid extensions to our method and other possible applications.
2 Learning Head-Tail Distribution
Just as introduced in § 1, we quantify the similarity between relations by their corresponding head-tail entity pair distributions. Consider the typical case that we have got numbers of facts, but they are still sparse among all facts in the real world. How could we obtain a well-generalized distribution over the whole space of possible triples beyond the training facts? This section proposes a method to parameterize such a distribution.
2.1 Formal Definition of Fact Distribution
A fact is a triple , where and are called head and tail entities, is the relation connecting them, and are the sets of entities and relations respectively. We consider a score function maps all triples to a scalar value. As a special case, the function can be factorized into the sum of two parts: . We use to define the unnormalized probability.
for every triple . The real parameter can be adjusted to obtain difference distributions over facts.
In this paper, we only consider locally normalized version of :
where and are directly parameterized by feed-forward neural networks. Through local normalization, is naturally a valid probability distribution, as the partition function . Therefore, .
2.2 Neural architecture design
Here we introduce our special design of neural networks. For the first part and the second part, we implement the scoring functions introduced in equation 2 as
represents a multi-layer perceptron composed of layers like, , are embeddings of , , and includes weights and biases in all layers.
Now we discuss the method to perform training. In this paper, we consider joint training. By minimizing the loss function, we compute the model parameters:
where is a set of triples.222In our applications, the set of triples could be a knowledge base or a set of triples in the training set etc. The whole set of parameters, . We train these parameters by Adam optimizer Kingma and Ba (2014). Training details are shown in Appendix C.
|Relation Representation||Method||Similarity Quantification|
|Vectors||TransE Bordes et al. (2013)|
|Vectors||DistMult Yang et al. (2015)|
|Matrices||RESCAL Nickel et al. (2011)|
|Angles||RotatE Sun et al. (2019)|
|Probability Distribution||Ours||equation 6|
3 Quantifying Similarity
So far, we have talked about how to use neural networks to approximate the natural distribution of facts. The center topic of our paper, quantifying similarity, will be discussed in detail in this section.
3.1 Relations as Distributions
In this paper, we provide a probability view of relations by representing relation as a probability distribution . After training the neural network on a given set of triples, the model is expected to generalize well on the whole space.
Note that it is very easy to calculate in our model thanks to local normalization (equation 2). Therefore, we can compute it by
3.2 Defining Similarity
As the basis of our definition, we hypothesize that the similarity between reflects the similarity between relations.333§ 5 provides empirical results to corroborate this hypothesis. For example, if the conditional distributions of two relations put mass on similar entity pairs, the two relations should be quite similar. If they emphasize different ones, the two should have some differences in meaning.
Formally, we define the similarity between two relations as a function of the divergence between the distributions of corresponding head-tail entity pairs:
denotes Kullback–Leibler divergence,
vice versa, and function is a symmetrical function. To keep the coherence between semantic meaning of “similarity” and our definition, should be a monotonically decreasing function. Through this paper, we choose to use an exponential family444We view KL divergences as energy functions. composed with max function, i.e., . Note that by taking both sides of KL divergence into account, our definition incorporates both the entity pairs with high probability in and . Intuitively, if mainly distributes on a proportion of entities pairs that emphasizes, is only hyponymy of . Considering both sides of KL divergence could help model yield more comprehensive consideration. We will talk about the advantage of this method in detail in § 3.4.
3.3 Calculating Similarity
Just as introduced in § 1, it is intractable to compute similarity exactly, as involving computation. Hence, we consider the monte-carlo approximation:
where is a list of entity pairs sampled from . We use sequential sampling555Sampling and at the same time requires computation, while sequential sampling requires only computation. to gain , which means we first sample given from , and then sample given and from .666It seems to be a non-symmetrical method, and sampling from the mixture of both forward and backward should yield a better result. Surprisingly, in practice, sampling from single direction works just as well as from both directions.
3.4 Relationship with other metrics
Previous work proposed various methods for representing relations as vectors Bordes et al. (2013); Yang et al. (2015), as matrices Nickel et al. (2011), even as angles Sun et al. (2019), etc. Based on each of these representations, one could easily define various similarity quantification methods.777Taking the widely used vector representations as an example, we can define the similarity between relations based on cosine distance, dot product distance, L1/L2 distance, etc. We show in Table 2 the best one of them in each category of relation presentation.
Here we provide two intuitive reasons for using our proposed probability-based similarity: (1) the capacity of a single fixed-size representation is limited — some details about the fact distribution is lost during embedding; (2) directly comparing distributions yields a better interpretability — you can not know about how two relations are different given two relation embeddings, but our model helps you study the detailed differences between probabilities on every entity pair. footnote 8 provides an example. Although the two relations talk about the same topic, they have different meanings. TransE embeds them as vectors the closest to each other, while our model can capture the distinction between the distributions corresponds to the two relations, which could be directly noticed from the figure.
4 Dataset Construction
We show the statistics of the dataset we use in Table 3, and the construction procedures will be introduced in this section.
In Wikidata Vrandečić and Krötzsch (2014), facts can be described as (Head item/property, Property, Tail item/property). To construct a dataset suitable for our task, we only consider the facts whose head entity and tail entity are both items. We first choose the most common 202 relations and 120000 entities from Wikidata as our initial data. Considering that the facts containing the two most frequently appearing relations (P2860: cites, and P31: instance of) occupy half of the initial data, we drop the two relations to downsize the dataset and make the dataset more balanced. Finally, we keep the triples whose head and tail both come from the selected 120000 entities as well as its relation comes from the remaining 200 relations.
4.2 ReVerb Extractions
ReVerb Fader et al. (2011) is a program that automatically identifies and extracts binary relationships from English sentences. We use the extractions from running ReVerb on Wikipedia999http://reverb.cs.washington.edu/. We only keep the relations appear more than 10 times and their corresponding triples to construct our dataset.
4.3 FB15K and TACRED
5 Human Judgments
Following Miller and Charles (1991); Resnik (1999) and the vast amount of previous work on semantic similarity, we ask nine undergraduate subjects to assess the similarity of 360 pairs of relations from a subset of Wikidata Vrandečić and Krötzsch (2014)101010Wikidata provides detailed descriptions to properties (relations), which could help subjects understand the relations better. that are chosen to cover from high to low levels of similarity. In our experiment, subjects were asked to rate an integer similarity score from 0 (no similarity) to 4 (perfectly the same)111111The detailed instruction is attached in the Appendix F.
for each pair. The inter-subject correlation, estimated by leaving-one-out methodWeiss and Kulikowski (1991), is r = , standard deviation = . This important reference value (marked in Figure 2) could be seen as the highest expected performance for machines Resnik (1999).
To get baselines for comparison, we consider other possible methods to define similarity functions, as shown in Table 2. We compute the correlation between these methods and human judgment scores. As the models we have chosen are the ones work best in knowledge base completion, we do expect the similarity quantification approaches based on them could measure some degree of similarity. As shown in Figure 2, the three baseline models could achieve moderate () positive correlation. On the other hand, our model shows a stronger correlation () with human judgment, indicating that considering the probability over whole entity pair space helps to gain a similarity closer to human judgments. These results provide evidence for our claim raised in § 3.2.
6 Redundant Relation Removal
Open IE extracts concise token patterns from plain text to represent various relations between entities, e.g.,, (Mark Twain, was born in, Florida). As Open IE is significant for constructing KBs, many effective extractors have been proposed to extract triples, such as Text-Runner Yates et al. (2007), ReVerb Fader et al. (2011), and Standford Open IE Angeli et al. (2015). However, these extractors only yield relation patterns between entities, without aggregating and clustering their results. Accordingly, there are a fair amount of redundant relation patterns after extracting those relation patterns. Furthermore, the redundant patterns lead to some redundant relations in KBs.
Recently, some efforts are devoted to Open Relation Extraction (Open RE) Lin and Pantel (2001); Yao et al. (2011); Marcheggiani and Titov (2016); ElSahar et al. (2017), aiming to cluster relation patterns into several relation types instead of redundant relation patterns. Whenas, these Open RE methods adopt distantly supervised labels as golden relation types, suffering from both false positive and false negative problems on the one hand. On the other hand, these methods still rely on the conventional similarity metrics mentioned above.
In this section, we will show that our defined similarity quantification could help Open IE by identifying redundant relations. To be specific, we set a toy experiment to remove redundant relations in KBs for a preliminary comparison (§ 6.1). Then, we evaluate our model and baselines on the real-world dataset extracted by Open IE methods (§ 6.2
). Considering the existing evaluation metric for Open IE and Open RE rely on either labor-intensive annotations or distantly supervised annotations, we propose a metric approximating recall and precision evaluation based on operable human annotations for balancing both efficiency and accuracy.
|Wikidata||188||112,946||426,067||§ 5 and § 6.1|
|ReVerb Extractions||3,736||194,556||266,645||§ 6.2|
|FB15K||1,345||14,951||483,142||§ 7.1 and § 8|
|TACRED||42||29,943||68,124||§ 7.2 and § 9|
6.1 Toy Experiment
In this subsection, we propose a toy environment to verify our similarity-based method. Specifically, we construct a dataset from Wikidata121212The construction procedure is shown in § 4.1. and implement Chinese restaurant process131313Chinese restaurant process is shown in Appendix B. to split every relation in the dataset into several sub-relations. Then, we filter out those sub-relations appearing less than 50 times to eventually get 1165 relations. All these split relations are regarded as different ones during training, and then different relation similarity metrics are adopted to merge those sub-relations into one relation. As Figure 2 shown that the matrices-based approach is less effective than other approaches, we leave this approach out of this experiment. The results are shown in Table 4.
6.2 Real World Experiment
In this subsection, we evaluate various relation similarity metrics on the real-world Open IE patterns. The dataset are constructed by ReVerb. Different patterns will be regarded as different relations during training, and we also adopt various relation similarity metrics to merge similar relation patterns. Because it is nearly impossible to annotate all pattern pairs for their merging or not, meanwhile it is also inappropriate to take distantly supervised annotations as golden results. Hence, we propose a novel metric approximating recall and precision evaluation based on minimal human annotations for evaluation in this experiment.
Approximating Recall and Precision
Recall is defined as the yielding fraction of true positive instances over the total amount of real positive141414Often called relevant in information retrieval field. instances. However, we do not have annotations about which pairs of relations are synonymous. Crowdsourcing is a method to obtain a large number of high-quality annotations. Nevertheless, applying crowdsourcing is not trivial in our settings, because it is intractable to enumerate all synonymous pairs in the large space of relation (pattern) pairs
in Open IE. A promising method is to use rejection sampling by uniform sampling from the whole space, and only keep the synonymous ones judged by crowdworkers. However, this is not practical either, as the synonymous pairs are sparse in the whole space, resulting in low efficiency. Fortunately, we could use normalized importance sampling as an alternative to get an unbiased estimation of recall.
Theorem 1.151515See proof in Appendix A
Suppose every sample has a label , and the model to be evaluated also gives its prediction . The recall can be written as
where is the uniform distribution over all samples with
is the uniform distribution over all samples with. If we have a proposal distribution satisfying , we get an unbiased estimation of recall:
where is a normalized version of , where is the unnormalized version of q, and are i.i.d. drawn from .
Similar to equation 9, we can write the expectation form of precision:
where is the uniform distribution over all samples with . As these samples could be found out by performing models on it. We can simply approximate precision by Monte Carlo Sampling:
In our setting, , means and are the same relations, means is larger than a threshold .
The results on the ReVerb Extractions dataset that we constructed are described in Figure 3. To approximate recall, we use the similarity scores as the proposal distribution . 500 relation pairs are then drawn from . To approximate precision, we set thresholds at equal intervals. At each threshold, we uniformly sample 50 to 100 relation pairs whose similarity score given by the model is larger than the threshold. We ask 15 undergraduates to judge whether two relations in a relation pair have the same meaning. A relation pair is viewed valid only if 8 of the annotators annotate it as valid. We use the annotations to approximate recall and precision with equation 10 and equation 12. Apart from the confidential interval of precision shown in the figure, the largest confidential interval among thresholds for recall is 161616The figure is shown in Figure 6. From the result, we could see that our model performs much better than other models’ similarity by a very large margin.
7 Error Analysis for Relational Classification
In this section, we consider two kinds of relational classification tasks: (1) relation prediction and (2) relation extraction. Relation prediction aims at predicting the relationship between entities with a given set of triples as training data; while relation extraction aims at extracting the relationship between two entities in a sentence.
7.1 Relation Prediction
We hope to design a simple and clear experiment setup to conduct error analysis for relational prediction. Therefore, we consider a typical method TransE Bordes et al. (2013) as the subject as well as FB15K Bordes et al. (2013) as the dataset. TransE embeds entities and relations as vectors, and train these embeddings by minimizing
where is the set of training triples, is the distance function, 171717Note that only head and tail entities are changed in the original TransE when doing link prediction. But changing results in better performance when doing relation prediction. is a negative sample with one element different from uniformly sampled from , and is the margin.
During testing, for each entity pair , TransE rank relations according to . For each in the test set, we call the relations with higher rank scores than distracting relations. We then compare the similarity between the golden relation and distracting relations. Note that some entity pairs could correspond to more than one relations, in which case we just do not see them as distracting relations.
7.2 Relation Extraction
For relation extraction, we consider the supervised relation extraction setting and TACRED dataset Zhang et al. (2017). As for the subject model, we use the best model on TACRED dataset — position-aware neural sequence model. This method first passes the sentence into an LSTM and then calculate an attention sum of the hidden states in the LSTM by taking positional features into account. This simple and effective method achieves the best in TACRED dataset.
Figure 5 shows the distribution of similarity ranks of distracting relations of the above mentioned models’ outputs on both relation prediction and relation extraction tasks. From Figures 3(b) and 3(a), we could observe the most distracting relations are the most similar ones, which corroborate our hypothesis that even the best models on these tasks still make mistakes among the most similar relations. This result also highlights the importance of a heuristic method for guiding models to pay more attention to the boundary between similar relations. We also try to do the negative sampling with relation type constraints, but we see no improvement compared with uniform sampling. The details of negative sampling with relation type constraints are presented in Appendix E.
|SDP-LSTM Xu et al. (2015)||66.3||52.7||58.7|
|PA-LSTM Zhang et al. (2017)||65.7||64.5||65.1|
|Neural+Ours||PA-LSTM (Softmax-Margin Loss)||68.5||64.7||66.6|
8 Similarity and Negative Sampling
Based on the observation presented in § 7.3, we find out that similar relations are often confusing for relation prediction models. Therefore, corrupted triples with similar relations can be used as high-quality negative samples.
For a given valid triple , we corrupt the triple by substituting with with the probability,
where is the temperature of the exponential function, the bigger the is, the flatter the probability distribution is. When the temperature approaches infinite, the sampling process reduces to uniform sampling.
In training, we set the initial temperature to a high level and gradually reduce the temperature. Intuitively, it enables the model to distinguish among those obviously different relations in the early stage and gives more and more confusing negative triples as the training processes to help the model distinguish the similar relations. This can be also viewed as a process of curriculum learningBengio et al. (2009), the data fed to the model gradually changes from simple negative triples to hard ones.
We perform relation prediction task on FB15K with TransE. Following Bordes et al. (2013), we use the "Filtered" setting protocol, i.e., filtering out the corrupted triples that appear in the dataset. Our sampling method is shown to improve the model’s performance, especially on Hit@1 (Figure 5). Training details are described in Appendix C.
9 Similarity and Softmax-Margin Loss
Similar to § 8, we find out that relation extraction models often make wrong preditions on similar relations. In this section, we use similarity as an adaptive margin in softmax-margin loss to improve the performance of relation extraction models.
As shown in Gimpel and Smith (2010), Softmax-Margin Loss can be expressed as
where denotes a structured output space for , and is example in training data.
We can easily incorporate similarity into cost function . In this task, we define the cost function as , where
is a hyperparameter.
Intuitively, we give a larger margin between similar relations, forcing the model to distinguish among them, and thus making the model perform better. We apply our method to Position-aware Attention LSTM (PA-LSTM)Zhang et al. (2017), and Table 5 shows our method improves the performance of PA-LSTM. Training details are described in Appendix C.
10 Related Works
As many early works devoted to psychology and linguistics, especially those works exploring semantic similarity Miller and Charles (1991); Resnik (1999), researchers have empirically found there are various different categorizations of semantic relations among words and contexts. For promoting research on these different semantic relations, bejar1991cognitive explicitly defining these relations and miller1995wordnet further systematically organize rich semantic relations between words via a database. For identifying correlation and distinction between different semantic relations so as to support learning semantic similarity, various methods have attempted to measure relational similarity Turney (2005, 2006); Zhila et al. (2013); Pedersen (2012); Rink and Harabagiu (2012); Mikolov et al. (2013b, a).
With the ongoing development of information extraction and effective construction of KBs Suchanek et al. (2007); Bollacker et al. (2008); Bizer et al. (2009), relations are further defined as various types of latent connections between objects more than semantic relations. These general relations play a core role in expressing relational facts in the real world. Hence, there are accordingly various methods proposed for discovering more relations and their facts, including open information extraction Brin (1998); Agichtein and Gravano (2000); Ravichandran and Hovy (2002); Banko et al. (2007); Zhu et al. (2009); Etzioni et al. (2011); Saha et al. (2017) and relation extraction Riedel et al. (2013); Liu et al. (2013); Zeng et al. (2014); Santos et al. (2015); Zeng et al. (2015); Lin et al. (2016), and relation prediction Bordes et al. (2013); Wang et al. (2014); Lin et al. (2015b, a); Xie et al. (2016).
For both semantic relations and general relations, identifying them is a crucial problem, requiring systems to provide a fine-grained relation similarity metric. However, the existing methods suffer from sparse data, which makes it difficult to achieve an effective and stable similarity metric. Motivated by this, we propose to measure relation similarity by leveraging their fact distribution so that we can identify nuances between similar relations, and merge those distant surface forms of the same relations, benefitting the tasks mentioned above.
11 Conclusion and Future Work
In this paper, we introduce an effective method to quantify the relation similarity and provide analysis and a survey of applications. We note that there are a wide range of future directions: (1) human prior knowledge could be incorporated into the similarity quantification; (2) similarity between relations could also be considered in multi-modal settings, e.g., extracting relations from images, videos, or even from audios; (3) by analyzing the distributions corresponding to different relations, one can also find some “meta-relations” between relations, such as hypernymy and hyponymy.
This work is supported by the National Natural Science Foundation of China (NSFC No. 61572273, 61532010), the National Key Research and Development Program of China (No. 2018YFB1004503). Chen and Zhu is supported by Tsinghua University Initiative Scientific Research Program, and Chen is also supported by DCST Student Academic Training Program. Han is also supported by 2018 Tencent Rhino-Bird Elite Training Program.
- Agichtein and Gravano (2000) Eugene Agichtein and Luis Gravano. 2000. Snowball: Extracting relations from large plain-text collections. In Proceedings of JCDL, pages 85–94.
- Angeli et al. (2015) Gabor Angeli, Melvin Jose Johnson Premkumar, and Christopher D Manning. 2015. Leveraging linguistic structure for open domain information extraction. In Proceedings of ACL, pages 344–354.
- Banko et al. (2007) Michele Banko, Michael J Cafarella, Stephen Soderland, Matthew Broadhead, and Oren Etzioni. 2007. Open information extraction from the web. In Proceedings of IJCAI, pages 2670–2676.
- Bejar et al. (1991) Isaac I Bejar, Roger Chaffin, and Susan E Embretson. 1991. Cognitive and psychometric analysis of analogical problem solving.
- Bengio et al. (2009) Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. 2009. Curriculum learning. In Proceedings of ICML, pages 41–48.
- Bizer et al. (2009) Christian Bizer, Jens Lehmann, Georgi Kobilarov, Sören Auer, Christian Becker, Richard Cyganiak, and Sebastian Hellmann. 2009. Dbpedia-a crystallization point for the web of data. Web Semantics: science, services and agents on the world wide web, 7:154–165.
- Bollacker et al. (2008) Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of SIGMOD, pages 1247–1250.
- Bordes et al. (2013) Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. 2013. Translating embeddings for modeling multi-relational data. In Proceedings of NIPS, pages 2787–2795.
- Brin (1998) Sergey Brin. 1998. Extracting patterns and relations from the world wide web. In Proceedings of WWW, pages 172–183.
- ElSahar et al. (2017) Hady ElSahar, Elena Demidova, Simon Gottschalk, Christophe Gravier, and Frederique Laforest. 2017. Unsupervised open relation extraction. In Proceedings of ESWC, pages 12–16.
- Etzioni et al. (2011) Oren Etzioni, Anthony Fader, Janara Christensen, Stephen Soderland, and Mausam Mausam. 2011. Open information extraction: the second generation. In Proceedings of IJCAI, pages 3–10.
- Fader et al. (2011) Anthony Fader, Stephen Soderland, and Oren Etzioni. 2011. Identifying relations for open information extraction. In Proceedings of EMNLP, pages 1535–1545.
- Gimpel and Smith (2010) Kevin Gimpel and Noah A Smith. 2010. Softmax-margin crfs: Training log-linear models with cost functions. In Proceedings of NAACL, pages 733–736.
- Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Lin and Pantel (2001) Dekang Lin and Patrick Pantel. 2001. Dirt@ sbt@ discovery of inference rules from text. In Proceedings of SIGKDDs, pages 323–328.
- Lin et al. (2015a) Yankai Lin, Zhiyuan Liu, Huanbo Luan, Maosong Sun, Siwei Rao, and Song Liu. 2015a. Modeling relation paths for representation learning of knowledge bases. In Proceedings of EMNLP, pages 705–714.
- Lin et al. (2015b) Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, and Xuan Zhu. 2015b. Learning entity and relation embeddings for knowledge graph completion. In Proceedings of AAAI, pages 2181–2187.
- Lin et al. (2016) Yankai Lin, Shiqi Shen, Zhiyuan Liu, Huanbo Luan, and Maosong Sun. 2016. Neural relation extraction with selective attention over instances. In Proceedings of ACL, pages 2124–2133.
- Liu et al. (2013) ChunYang Liu, WenBo Sun, WenHan Chao, and Wanxiang Che. 2013. Convolution neural network for relation extraction. In Proceedings of ICDM, pages 231–242.
- Maaten and Hinton (2008) Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-sne. JMLR, 9:2579–2605.
- Marcheggiani and Titov (2016) Diego Marcheggiani and Ivan Titov. 2016. Discrete-state variational autoencoders for joint discovery and factorization of relations. TACL, 4:231–244.
- Mikolov et al. (2013a) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. In Proceedings of ICLR.
- Mikolov et al. (2013b) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013b. Distributed representations of words and phrases and their compositionality. In Proceedings of NIPS, pages 3111–3119.
- Miller (1995) George A Miller. 1995. Wordnet: a lexical database for english. Communications of the ACM, 38:39–41.
- Miller and Charles (1991) George A Miller and Walter G Charles. 1991. Contextual correlates of semantic similarity. Language and cognitive processes, 6:1–28.
- Nickel et al. (2011) Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. 2011. A three-way model for collective learning on multi-relational data. In Proceedings of ICML, pages 809–816.
- Owen (2013) Art B. Owen. 2013. Monte Carlo theory, methods and examples.
- Pedersen (2012) Ted Pedersen. 2012. Duluth: Measuring degrees of relational similarity with the gloss vector measure of semantic relatedness. In Proceedings of SemEval 2012, pages 497–501.
- Ravichandran and Hovy (2002) Deepak Ravichandran and Eduard Hovy. 2002. Learning surface text patterns for a question answering system. In Proceedings of ACL, pages 41–47.
- Resnik (1999) Philip Resnik. 1999. Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. Journal of artificial intelligence research, 11:95–130.
- Riedel et al. (2013) Sebastian Riedel, Limin Yao, Andrew McCallum, and Benjamin M Marlin. 2013. Relation extraction with matrix factorization and universal schemas. In Proceedings of NAACL, pages 74–84.
- Rink and Harabagiu (2012) Bryan Rink and Sanda Harabagiu. 2012. Utd: Determining relational similarity using lexical patterns. In Proceedings of SemEval 2012, pages 413–418.
- Saha et al. (2017) Swarnadeep Saha, Harinder Pal, et al. 2017. Bootstrapping for numerical open ie. In Proceedings of ACL, volume 2, pages 317–323.
- Santos et al. (2015) Cicero Nogueira dos Santos, Bing Xiang, and Bowen Zhou. 2015. Classifying relations by ranking with convolutional neural networks. In Proceedings of ACL-IJCNLP, pages 626–634.
- Suchanek et al. (2007) Fabian M Suchanek, Gjergji Kasneci, and Gerhard Weikum. 2007. Yago: a core of semantic knowledge. In Proceedings of WWW, pages 697–706.
- Sun et al. (2019) Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, and Jian Tang. 2019. Rotate: Knowledge graph embedding by relational rotation in complex space. In Proceedings of ICLR.
- Turney (2005) Peter D Turney. 2005. Measuring semantic similarity by latent relational analysis. In Proceedings of IJCAI, pages 1136–1141.
- Turney (2006) Peter D Turney. 2006. Similarity of semantic relations. Computational Linguistics, 32:379–416.
- Vrandečić and Krötzsch (2014) Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: a free collaborative knowledgebase. Communications of the ACM, 57:78–85.
- Wang et al. (2014) Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen. 2014. Knowledge graph embedding by translating on hyperplanes. In Proceedings of AAAI, pages 1112–1119.
- Weiss and Kulikowski (1991) Sholom M Weiss and Casimir A Kulikowski. 1991. Computer systems that learn: classification and prediction methods from statistics, neural nets, machine learning, and expert systems.
- Xie et al. (2016) Ruobing Xie, Zhiyuan Liu, and Maosong Sun. 2016. Representation learning of knowledge graphs with hierarchical types. In Proceedings of IJCAI, pages 2965–2971.
- Xu et al. (2015) Yan Xu, Lili Mou, Ge Li, Yunchuan Chen, Hao Peng, and Zhi Jin. 2015. Classifying relations via long short term memory networks along shortest dependency paths. In Proceedings of EMNLP, pages 1785–1794.
- Yang et al. (2015) Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. 2015. Embedding entities and relations for learning and inference in knowledge bases. In Proceedings of ICLR.
- Yao et al. (2011) Limin Yao, Aria Haghighi, Sebastian Riedel, and Andrew McCallum. 2011. Structured relation discovery using generative models. In Proceedings of EMNLP, pages 1456–1466.
- Yates et al. (2007) Alexander Yates, Michael Cafarella, Michele Banko, Oren Etzioni, Matthew Broadhead, and Stephen Soderland. 2007. Textrunner: open information extraction on the web. In Proceedings of NAACL, pages 25–26.
- Zeng et al. (2015) Daojian Zeng, Kang Liu, Yubo Chen, and Jun Zhao. 2015. Distant supervision for relation extraction via piecewise convolutional neural networks. In Proceedings of EMNLP, pages 1753–1762.
- Zeng et al. (2014) Daojian Zeng, Kang Liu, Siwei Lai, Guangyou Zhou, and Jun Zhao. 2014. Relation classification via convolutional deep neural network. In Proceedings of COLING, pages 2335–2344.
- Zhang et al. (2017) Yuhao Zhang, Victor Zhong, Danqi Chen, Gabor Angeli, and Christopher D Manning. 2017. Position-aware attention and supervised data improve slot filling. In Proceedings of EMNLP, pages 35–45.
- Zhila et al. (2013) Alisa Zhila, Wen-tau Yih, Christopher Meek, Geoffrey Zweig, and Tomas Mikolov. 2013. Combining heterogeneous models for measuring relational similarity. In Proceedings of NAACL, pages 1000–1009.
- Zhu et al. (2009) Jun Zhu, Zaiqing Nie, Xiaojiang Liu, Bo Zhang, and Ji-Rong Wen. 2009. Statsnowball: a statistical approach to extracting entity relationships. In Proceedings of WWW, pages 101–110.
Appendix A Proofs to theorems in the paper
If we have a proposal distribution satisfying , then equation 16 can be further written as
Sometimes, it’s hard for us to compute normalized probability . To tackle this problem, consider self-normalized importance sampling as an unbiased estimation Owen (2013),
where is the normalized version of . ∎
Appendix B Chinese Restaurant Process
Specifically, for a relation with currently sub-relations, we turn it to a new sub-relation with probability
or to the existing sub-relation with probability
where is the size of existing sub-relation, is the sum of the number of all sub-relationships of , and is a hyperparameter, in which case we use .
Appendix C Training Details
In Wikidata and ReVerb Extractions dataset, we manually split a validation set, assuring every entity and relation appears in validation set also appears in training set. While minimizing loss on the training set, we observe the loss on the validation set and stop training as validation loss stops to decrease. Before training our model on any dataset, we use the entity embeddings and relation embeddings produced by TransE on the dataset as the pretrained embeddings for our model.
c.1 Training Details on Negative Sampling
The sampling is launched with an initial temperature of 8192. The temperature drops to half every 200 epochs and remains stable once it hits 16. Optimization is performed using SGD, with a learning rate of 1e-3.
c.2 Training Details on Softmax-Margin Loss
The sampling is launching with an initial temperature of 64. The temperature drops by 20% per epoch, and remains stable once it hits 16. The alpha we use is 9. Optimization is performed using SGD, with a learning rate of 1.
Appendix D Recall Standard Deviation
As is shown in Figure 6, the max recall standard deviation for our model is 0.4, and 0.11 for TransE.
Appendix E Negative Samplilng with Relation Type Constraints
In FB15K, if two relations have same prefix, we regard them as belonging to a same type, e.g., both /film/film/starring./film/performance/actor and /film/actor/film./film/performance/film have prefix film, they belong to same type. Similar to what is mentioned in § 8, we expect the model first to learn to distinguish among obviously different relations, and gradually learn to distinguish similar relations. Therefore, we conduct negative sampling with relation type constraints in two ways.
e.1 Add Up Two Uniform Distribution
For each triple , we have two uniform distribution and . is the uniform distribution over all the relations except for those appear with in the knowledge base, and is the uniform distribution over the relations of the same type as . When corrupting the triple, we sample from the distribution:
where is a hyperparameter. We set to 1 at the beginning of training, and every epochs, will be multiplied by decrease rate . We do grid search for and , but no improvement is observed.
e.2 Add Weight
We speculate that the unsatisfactory result produced by adding up two uniform distribution is because that for those types with few relations in it, a small change of will result in a significant change in . Therefore, when sampling a negative , we add weights to relations that are of the same type as instead. Concretely, we substitute with with probability , which can be calculated as:
where denotes all the relations that are the same type as , is a hyperparameter and is a normalizing constant. We set to 0 at the beginning of training, and every epochs, will increase by . We do grid search for and , still no improvement is observed.
Appendix F Wikidata annotation guidance
We show the guidance provided for the annotators here.
A pair of relations should be marked as 4 points if the two relations are only two different expressions for a certain meaning.
Example: (study at, be educated at)
A pair of relations should be marked as 3 points if the two relations are describing a same topic, and the entities that the two relations connect are of same type respectively.
Example: (be the director of, be the screenwriter of), both relations relate to movie, and the types of the entities they connect are both (person, movie).
A pair of relations should be marked as 2 points if the two relations are describing a same topic, but the entities that the two relations connect are of different type respectively.
Example: (be headquartered in, be founded in), both relations relate to organization, but the types of the entities they connect are different, i.e., (company, location) and (company, time)
A pair of relations should be marked as 1 points if the two relations do not meet the conditions above but still have semantic relation.
Example: (be the developer of, be the employer of)
A pair of relations should be marked as 0 points if the two relations do not have any connection.
Example: (be a railway station locates in, be published in)