Relation extraction (RE) extracts semantic relations between entities from plain text. For instance, “Jon Robin Baitz , born in Los Angeles …” expresses the relation /people/person/place_of_birth between the two head-tail entities. Extracted relations are then used for several downstream tasks such as information retrieval Corcoglioniti et al. (2016) and knowledge base construction Al-Zaidy and Giles (2018)
. RE has been widely studied using fully supervised learningNguyen and Grishman (2015); Miwa and Bansal (2016); Zhang et al. (2017, 2018) and distantly supervised approaches Mintz et al. (2009); Riedel et al. (2010); Lin et al. (2016).
Unsupervised relation extraction (URE) methods have not been explored as much as fully or distantly supervised learning techniques. URE is promising, since it does not require manually annotated data nor human curated knowledge bases (KBs), which are expensive to produce. Therefore, it can be applied to domains and languages where annotated data and KBs are not available. Moreover, URE can discover new relation types, since it is not restricted to specific relation types in the same way as fully and distantly supervised methods. One might argue that Open Information Extraction (OpenIE) can also discover new relations. However, OpenIE identifies relations based on textual surface information. Thus, similar relations with different textual forms may not be recognised. Unlike OpenIE, URE groups similar relations into clusters. Despite these advantages, there are only a few attempts tackling URE using machine learning (ML)Hasegawa et al. (2004); Banko et al. (2007); Yao et al. (2011); Marcheggiani and Titov (2016); Simon et al. (2019).
Similarly to other unsupervised learning tasks, a challenge in URE is how to evaluate results. Recent approachesYao et al. (2011); Marcheggiani and Titov (2016); Simon et al. (2019) employ a widely used data generation setting in distantly supervised RE, i.e., aligning a large amount of raw text against triplets in a curated KB. A standard metric score is computed by comparing the output relation clusters against the automatically annotated relations. In particular, the NYT-FB dataset Marcheggiani and Titov (2016) which is used for evaluation, has been created by mapping relation triplets in Freebase Bollacker et al. (2008) against plain text articles in the New York Times (NYT) corpus Sandhaus (2008)
. Standard clustering evaluation metrics for URE include BBagga and Baldwin (1998), V-measure Rosenberg and Hirschberg (2007), and ARI Hubert and Arabie (1985).
Although the above mentioned experimental setting can be created automatically, there are three challenges to overcome. Firstly, the development and test sets are silver, i.e., they include noisy labelled instances, since they are not human-curated. Secondly, the development and test sentences are part of the training set, i.e., a transductive setting. It is thus unclear how well the existing models perform on unseen sentences. Finally, NYT-FB can be considered highly imbalanced, since only 2.1% of the training sentences can be aligned with Freebase’s triplets. Due to the noisy nature of silver data (NYT-FB), evaluation on silver data will not accurately reflect the system performance. We also need unseen data during testing to examine the system generalisation. To overcome these challenges, we will employ the test set of TACRED Zhang et al. (2017), a widely used manually annotated corpus. Regarding the imbalanced data, we will demonstrate that in fact around 60% (instead of 2.1%) of instances in the training set express relation types defined in Freebase.
In this work, we present a simple URE approach relying only on entity types that can obtain improved performance compared to current methods. Specifically, given a sentence consisting of two entities and their corresponding entity types, e.g., PERSON and LOCATION, we induce relations as the combination of entity types, e.g., PERSON-LOCATION. It should be noted that we employ only entity types because their combinations form reasonably coarse relation types (e.g., PERSON-LOCATION covers /people/person/place_of_birth defined in Freebase). We further discuss our improved performance in section 3.
Our contributions are as follows: (i) We perform experiments on both automatically/manually-labelled datasets, namely NYT-FB and TACRED, respectively. We show that two methods using only entity types can outperform the state-of-the-art models including both feature-engineering and deep learning approaches. The surprising results raise questions about the current state of unsupervised relation extraction. (ii) For model design, we show that link predictor provides a good signal to train a URE model (Fig 1). We also illustrate that entity types are a strong inductive bias for URE (table 1).
2 Methods for URE
The goal of URE is to predict the relation between two entities and in a sentence . We will describe three recent ML-based methods tackling URE and our own methods. We divide the ML-based methods into two main approaches: generative and discriminative.
2.1 Generative Approach
Yao et al. (2011) extended topic modelling – Latent Dirichlet Allocation (LDA) Blei et al. (2003) for RE, developing two models, herewith RelLDA and RelLDA1. In both models, a sentence and an entity pair perform as a document in topic modelling, while a relation type corresponds to a topic. RelLDA uses three features, i.e., the shortest dependency path between two entities and the two entity mentions. RelLDA1 extends RelLDA with five more features, i.e., the entity types, words and part-of-speech tags between the two entities.
2.2 Discriminative Approaches
Marcheggiani and Titov (2016)
proposed a discrete-state variational autoencoder (VAE) to tackle URE (herewithMarch). Their model consists of two components: a relation classifier and a link predictor. The relation classifier, which is discriminative, takes entity types and several linguistic features (e.g., dependencies) as input to predict the relation . The link predictor then uses the (soft) predicted relation to predict the missing entity in a specific position , given the other entity , where if then
and vice versa. In other words, entity prediction, in a self-supervised manner, provides training signals to learn the relation classifier. However, by using only entity prediction, only a few relation types are chosen. They thus usedentropy over all relations as a regulariser. The maximisation of the entropy regulariser ensures the uniform relation distribution and allows more relations to be predicted.
Another discriminative method is by Simon et al. (2019) (herewith Simon) which differs from March in the following ways: a) firstly, its relation classifier employs a piece-wise convolutional network (PCNN) using only surface form without requiring hand-crafted features; b) secondly, they replaced entropy with two regularisers: (skewness), to encourage the relation classifier to be confident in its prediction, and (dispersion), to ensure several relation types are predicted over a minibatch. Note that, is equivalent to the negation of the entropy used in March.
2.3 Our Methods
We introduce two entity-based methods, herewith EType and EType+. Our motivation is that entity types are helpful for RE, as mentioned in zhang-etal-2017-position for supervised learning and ren2017cotype for distant learning. In URE, yao-etal-2011-structured, marcheggiani-titov-2016-discrete also used entity types. We therefore propose EType that induces coarse relation clusters from the entity types. In particular, given two entity types , as input, EType would output their concatenation - as the relation.
One problem with EType is that the number of relation types is determined by the number of entity types. For instance, 4 entity types lead to relation types. To extract an arbitrary number of relation types, we build a relation classifier that consists of one-layer feed-forward network taking entity type combinations as input:
is the one hot vector of the entity type pair. We then employ the link predictor used in March and the two regularisers used in Simon, to produce a new method, herewith EType+.
3 Experiments and Results
We use the following evaluation metrics for our analysis: a) B Bagga and Baldwin (1998)2007), and c) ARI Hubert and Arabie (1985) used in Simon et al. (2019). 222We used sklearn.metrics package to compute V-measure and ARI. V-measure is analysed in terms of homogeneity and completeness, while ARI measures the similarity between two clusterings. We note that V-measure is sensitive to the dependency between the number of clusters and instances. A relatively small number of clusters compared to the number of instances should be used to maintain the comparability of using V-measure. More precisely, we evaluated V-measure of the trivial homogeneity, where there are only singular clusters (i.e., each instance is its own cluster). The V-measure of the trivial homogeneity on NYT-FB reached 43.77%, which is higher than all the implemented methods in table 1. Meanwhile, neither B nor ARI encounters this problem.
We employed NYT-FB for training and evaluation following previous work Yao et al. (2011); Marcheggiani and Titov (2016); Simon et al. (2019). Because only 2.1% of the sentences in NYT-FB were aligned against Freebase’s triplets, we were concerned whether this dataset contains enough sentences for a model to learn relation types from Freebase. We thus examined randomly chosen instances from 1.86m non-aligned sentences. We found that 61% of them (or 60% of the whole dataset) express relation types in Freebase. This suggests that the NYT-FB dataset can be employed to train a relation extractor. However, there are two further issues when evaluating URE methods on NYT-FB. Firstly, the development and test sets are all aligned sentences without human curation, which means that they include wrong/noisy labelled instances. In particular, we found that 35 out of 100 randomly chosen sentences were given incorrect relations. Secondly, the two validation sets are part of the training set. This setting is obviously not inductive, as it does not evaluate how a model performs on unseen sentences. Therefore, we additionally evaluate all methods (except topic modelling) on the test set of TACRED Zhang et al. (2017), a widely used manually annotated corpus for supervised RE. The statistics of both NYT-FB and TACRED are provided in appendix A.
We examine three models RelLDA1, March, and Simon using the reported hyper-parameters Yao et al. (2011); Marcheggiani and Titov (2016); Simon et al. (2019). For comparison, we also evaluate March with the two regularisers of Simon, namely March (). To evaluate on TACRED, we employed the original March with using the published repository333github.com/diegma/relation-autoencoder. Meanwhile, for March (+) and Simon, we reimplemented these models and evaluated them on TACRED. Regarding our methods, EType does not have hyper-parameters, while EType+ uses the same optimiser and entity type dimension as in Simon. All the hyper-parameters used in our experiments are listed in appendix B.
table 1 demonstrates the average performance of our methods across three runs in comparison with the three ML models on NYT-FB and TACRED. Our models outperform the best performing system of simon-etal-2019-unsupervised on both datasets, except ARI on NYT-FB. ARI is shown to be used when there are large equal-sized clusters Romano et al. (2016) while relation datasets are generally imbalanced (both NYT-FB and TACRED in this study; please refer to appendix A for the detailed statistics). Due to this reason, ARI might not be appropriate to evaluate URE systems. In addition, the ML methods consistently exhibit lower performance on TACRED than on NYT-FB. The full results are shown in appendix C.
The results of our evaluation demonstrate that our models outperform previous methods, despite being simpler than them. These results lead us to the following findings.
Do ML models employ proper inductive biases?
In common with other unsupervised learning approaches, there is no guarantee that a URE model would learn the relation types in the used KBs and/or annotated data. A common solution is to employ inductive biases Wagstaff (2000) to guide the learning process towards desired relation types. Inductive biases can emanate from pre-processed data. Since our models outperform other methods, we conclude that entity type information alone constitutes a better bias than the biases employed by existing ML models. Indeed, entity types constitute a useful bias for this task. Among the topic modelling based methods, RelLDA1 outperforms RelLDA, which does not employ entity types. In a separate experiment, we found that adding entity types to the Simon model helped to achieve higher performance than the original version, i.e., 42.74% vs. 39.4% F1 B on the NYT-FB test set. However, although both RelLDA1 and March also employ entity types, their performance is still lower than ours. This is because other syntactic and word features used in these two models might cancel out the useful bias of entity types. (More details are in the last paragraph of this section.)
Inductive biases can emanate from training signals. March and Simon are trained from a link predictor, which provides indirect signals to train a relation classifier. Hence, the question here is “can the link predictor induce good training signals?” To answer this, we examine the link predictor with alternative settings:
Rand10 randomly assigns one among 10 relation types to each entity pair;
Rand10 with silver frequencies, similar to Rand10, randomly generates relation types but follows the silver relation distribution;
One relation assumes all entity pairs sharing the same relation type;
EType uses 16 relation types induced from 4 coarse entity types;
Silver relations (10) takes the top 9 most frequent relation types and groups the rest together to form the tenth relation type;
Silver relations (full) considers the full (silver) annotated relations, i.e., 262 types.
fig. 1 illustrates the average loss values of using these settings. If high quality relations were critical for training the link predictor, we would expect lower losses while using annotated relations. Indeed, the loss curve of using 10 correct relation types is consistently below all the others. This implies that the link predictor is able to provide reasonable signals for training a relation classifier. So why are the Simon and March models outperformed by our models? As pointed out by simon-etal-2019-unsupervised, the link predictor itself cannot be trained without a good relation classifier. It suggests that the relation classifiers in both methods need to be improved. Empirical evidence shows that both Simon and March models are outperformed (in B and V) by our Etype+, which uses the same link predictor. We also notice that both One relation and EType
at the end sharing similar performances. This might imply that we only need one relation (matrix) to predict head/tail entities, as the link predictor is very expressive. However, the silver relations are clearly helpful as during the first 15 epochs their losses are much lower than others.
Why was the performance on TACRED lower?
Despite the fact that TACRED shares similar relation types with Freebase, we observed that both the March and Simon models consistently fare less well in terms of their performance on the TACRED dataset. More precisely, Simon model results in significantly worse performance on TACRED, with 15.7% in terms of B3, which is twice as low as on NYT-FB (39.4%). This performance drop might be attributed to the distributional shift of the two datasets: variation and semantic shift in vocabulary and language structure over time, since NYT was collected long before TACRED.
How is the performance when combining entity types with other features?
Our experiments using only entity types surprisingly perform higher than the previous state-of-the-art methods including feature engineering and deep learning models. However, we know that context information is crucial to distinguish the relation between two entities, as many RE studies have been proposed to integrate the context information to improve the RE performance. We conduct experiments when combining entity types with common features for RE in table 2. The list of features include: (i) Entity: textual surface form of two entities, (ii) BOW: bag of words between two entities, (iii) DepPath: words on the dependency path between two entities, (iv) POS: part-of-speech tag sequence between two entities, and (v) Trigger: DepPath without stop words. In general, naively combining entity types with other features could not improve the model performance. Additionally, BOW feature had negative effects on the RE performance. This indicates that bag of words between two entities often include uninformative and redundant words, i.e., noises, that are difficult to eliminate using simple neural architectures. While (i)-(v) are widely used hand-crafted features for RE, we also incorporated a neural-based context encoder PCNN which is the combination of Simon’s PCNN encoder, the entity masking and position-aware attention proposed in Zhang et al. (2017). However, the performance of combining PCNN is also lower than only entity types.
We have shown the importance of entity types in URE. Our methods use only entity types, yet they yield higher performance than previous work on both NYT-FB and TACRED. We have investigated the current experimental setting, concluding that a strong inductive bias is required to train a relation extraction model without labelled data. URE remains challenging, which requires improved methods to deal with silver data. We also plan to use different types of labelled data, e.g., domain specific data sets, to ascertain whether entity type information is more discriminative in sub-languages.
We would like to thank the reviewers for their comments, Diego Marcheggiani for sharing his dataset with us, and Étienne Simon for sharing the hyperparameters. The first author thanks the University of Manchester for the Research Impact Scholarship Award. This work is also funded by Lloyd’s Register Foundation, Discovering Safety Programme, Thomas Ashton Institute.
- Extracting semantic relations for scholarly knowledge base construction. In 2018 IEEE 12th international conference on semantic computing (ICSC), pp. 56–63. Cited by: §1.
- Algorithms for scoring coreference chains. In Proceedings of the First Iternational Conference on Language Resources and Evaluation Workshop on Linguistics Coreference, Vol. 1, pp. 563–566. Cited by: §1, §3.
- Open information extraction from the web.. In IJCAI, Vol. 7, pp. 2670–2676. Cited by: §1.
- Latent dirichlet allocation. Journal of Machine Learning Research 3 (Jan), pp. 993–1022. Cited by: §2.1.
- Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pp. 1247–1250. Cited by: §1.
- Knowledge extraction for information retrieval. In European Semantic Web Conference, pp. 317–333. Cited by: §1.
- Discovering relations among named entities from large corpora. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), Barcelona, Spain, pp. 415–422. External Links: Cited by: §1.
- Comparing partitions. Journal of classification 2 (1), pp. 193–218. Cited by: §1, §3.
- Neural relation extraction with selective attention over instances. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 2124–2133. External Links: Cited by: §1.
- Discrete-state variational autoencoders for joint discovery and factorization of relations. Transactions of the Association for Computational Linguistics 4, pp. 231–244. External Links: Cited by: Appendix A, 3(a), Table 5, Appendix C, §1, §1, §2.2, §3, §3.
Distant supervision for relation extraction without labeled data.
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Suntec, Singapore, pp. 1003–1011. External Links: Cited by: §1.
- End-to-end relation extraction using LSTMs on sequences and tree structures. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 1105–1116. External Links: Cited by: §1.
Relation extraction: perspective from convolutional neural networks. In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, Denver, Colorado, pp. 39–48. External Links: Cited by: §1.
- Modeling relations and their mentions without labeled text. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 148–163. Cited by: §1.
- Adjusting for chance clustering comparison measures. Journal of Machine Learning Research 17 (1), pp. 4635–4666. Cited by: §3.
- V-measure: a conditional entropy-based external cluster evaluation measure. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), Prague, Czech Republic, pp. 410–420. External Links: Cited by: §1, §3.
- The new york times annotated corpus. Linguistic Data Consortium, Philadelphia 6 (12), pp. e26752. Cited by: §1.
- Unsupervised information extraction: regularizing discriminative approaches with relation distribution losses. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 1378–1387. External Links: Cited by: 3(b), Table 5, Appendix C, §1, §1, §2.2, §3, §3, §3.
- Refining inductive bias in unsupervised learning via constraints. In AAAI/IAAI, pp. 1112. Cited by: §4.
- Structured relation discovery using generative models. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, Edinburgh, Scotland, UK., pp. 1456–1466. External Links: Cited by: §1, §1, §2.1, §3, §3.
- Graph convolution over pruned dependency trees improves relation extraction. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 2205–2215. External Links: Cited by: §1.
- Position-aware attention and supervised data improve slot filling. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 35–45. External Links: Cited by: Appendix A, §1, §1, §3, §4.
Appendix A Datasets
table 3 shows the statistics of the NYT-FB Marcheggiani and Titov (2016) and TACRED Zhang et al. (2017) datasets. We followed the same data split and pre-processing described in Marcheggiani and Titov (2016). For all methods, we trained on NYT-FB and evaluated them on both NYT-FB and TACRED.
fig. 2 illustrates the relation distributions of two datasets: NYT-FB and TACRED. We can see that 15/253 most frequent relations account for 82.97% of the total number of instances in NYT-FB. Meanwhile, 15/41 relations sum upto 74.94% of the total number of instances in TACRED.
Appendix B Hyper-parameter Settings
We used the development set to stop the training process. For every model, we conducted three runs with different initialised parameters and computed the average performance. We list the hyper-parameters of different models in table 4.
Appendix C Detailed Results
table 5 presents the average test scores of three runs on the NYT-FB and TACRED datasets. We note that the two models proposed by Marcheggiani and Titov (2016) and Simon et al. (2019) are sensitive to the hyper-parameters and thus difficult to train. We could not replicate the performance of Simon on the NYT-FB dataset.