Relation classification (RC) is an important task in NLP, aiming to determine the correct relation between two entities in a given sentence. Many works have been proposed for this task, including kernel methods Zelenko et al. (2002); Mooney and Bunescu (2006), embedding methods Gormley et al. (2015), and neural methods Zeng et al. (2014). The performance of these conventional models heavily depends on time-consuming and labor-intensive annotated data, which make themselves hard to generalize well. Adopting distant supervision is a primary approach to alleviate this problem for RC Mintz et al. ; Riedel et al. ; Hoffmann et al. (2011); Surdeanu et al. (2012); Zeng et al. (2015); Lin et al. (2016)
, which heuristically aligns knowledge bases (KBs) and text to automatically annotate adequate amounts of training instances. We evaluate the model proposed byLin et al. (2016), which is followed by the recent state-of-the-art methods Zeng et al. (2017); Ji et al. (2017); Huang and Wang (2017); Wu et al. (2017); Liu et al. (2017); Feng et al. (2018); Zeng et al. (2018), on the benchmark dataset NYT-10 Riedel et al. . Though it achieves promising results on common relations, the performance of a relation drops dramatically when its number of training instances decrease. About of the relations in NYT-10 are long-tail with fewer than
instances. Furthermore, distant supervision suffers from the wrong labeling problem, which makes it harder to classify long-tail relations. Hence, it is necessary to study training RC models with insufficient training instances.
|(A) capital_of||(1) London is the capital of the U.K.|
|(2) Washington is the capital of the U.S.A.|
|(B) member_of||(1) Newton served as the president of the Royal Society.|
|(2) Leibniz was a member of the Prussian Academy of Sciences.|
|(C) birth_name||(1) Samuel Langhorne Clemens, better known by his pen name Mark Twain, was an American writer.|
|(2) Alexei Maximovich Peshkov, primarily known as Maxim Gorky, was a Russian and Soviet writer.|
|(A) or (B) or (C)||Euler was elected a foreign member of the Royal Swedish Academy of Sciences.|
We formulate RC as a few-shot learning task in this paper, which requires models capable of handling classification task with a handful of training instances, as shown in Table 1. Many efforts have devoted to few-shot learning. The early works Caruana (1995); Bengio (2012); Donahue et al. (2014)
apply transfer learning methods to finetune pre-trained models from the common classes containing adequate instances to the uncommon classes with only few instances. Then metric learning methodsKoch et al. (2015); Vinyals et al. (2016); Snell et al. (2017)
have been proposed to learn the distance distributions among classes. Similar classes are adjacent in the distance space. The metric methods also take advantage of non-parametric estimation to make models efficient and general. Recently, the idea of meta-learning is proposed, which encourages the models to learn fast-learning abilities from previous experience and rapidly generalize to new concepts. Many meta-learning modelsRavi and Larochelle (2017); Santoro et al. (2016); Finn et al. (2017); Munkhdalai and Yu (2017) achieve the state-of-the-art results on several few-shot benchmarks.
Though meta-learning methods develop fast, most of these works evaluate on two popular datasets, Omniglot Lake et al. (2015)
and mini-ImageNetVinyals et al. (2016). Both the datasets concentrate on image classification. Many works in NLP mainly focus on the zero-shot/semi-supervised scenario Xie et al. (2016); Ma et al. (2016); Carlson et al. (2009), which incorporate extra information to classify objects never appearing in the training sets. However, the few-shot scenario needs models to classify objects with few instances without any extra information. Recently, Yu et al. (2018) propose a multi-metric method for few-shot text classification. However, there lack systematic researches about adopting few-shot learning for NLP tasks. We propose FewRel: a new large-scale supervised Few-shot Relation Classification dataset. To address the wrong labeling problem in most distantly supervised RC datasets, we apply crowd-sourcing to manually remove the noise.iiiMany previous works, such as (Roth et al., 2013; Luo et al., 2017; Xin et al., 2018) have worked on automatically removing noise from distantly supervision. Instead, we use crowd-sourcing methods to achieve a high accuracy.
Besides constructing the dataset, we systematically implement the most recent state-of-the-art few-shot learning methods and adapt them for RC. We conduct a detailed evaluation for all these models on our dataset. Though the state-of-the-art few-shot learning methods have much lower results than humans on our challenging dataset, they significantly outperform the vanilla RC models, indicating that incorporating few-shot learning is promising and needs further research. In summary, our contribution is three-fold:
(1) We formulate RC as a few-shot learning task, and propose a new large supervised few-shot RC dataset.
(2) We systematically adapt the most recent state-of-the-art few-shot learning methods for RC, which may further benefit other NLP tasks.
(3) We conduct a comprehensive evaluation of few-shot learning methods on our dataset, which indicates some promising research directions for RC.
2 FewRel Dataset
In this section, we describe the process of creating FewRel in detail. The whole procedure can be divided into two steps: (1) We create a large candidate set of sentences aligned to relations via distant supervision. (2) We ask human annotators to filter out the wrong labeled sentences for each relation to finally achieve a clean RC dataset.
2.1 Distant Supervision
For the first step, We use Wikipedia as the corpusiiiiiiWe use whole Wikipedia articles as corpus, not just the first sentence.
and Wikidata as the KB. Wikidata is a large-scale KB where many entities are already linked to Wikipedia articles. The articles in Wikipedia also contain anchors linking to each other. Thus it is convenient to align sentences in Wikipedia articles to KB facts in Wikidata. We also employ entity linking technique to extract more unanchored entities in articles. We first adopt named entity recognition via spaCyiiiiiiiiihttps://spacy.io/ to find possible entity mentions, then match each mention with the name of an entity in KBs, and link the mention to the entity if successfully matched.
For each sentence in Wikipedia articles containing head and tail entities and , if there exists a Wikidata statement meaning and have the relation , we denote the tuple as an instance and add it to the candidate set. Empirically, many instances of a given relation contain the same entity pair. For such relation, classifiers may prefer memorizing the entity pairs in the training instances rather than grasping the sentence semantics. Therefore, in the candidate set of each relation, we only keep instance for each unique entity pair. Finally, we remove relations with fewer than instances, and randomly keep instances for the rest of the relations. As a result, we get a candidate set of relations and instances.
2.2 Human Annotation
Next, we invite some well-educated annotators to filter the raw data on a platform similar to Amazon MTurk developed by ourselves. The platform presents each annotator with one instance each time, by showing the sentence, two entities in the sentence, and the corresponding relation labeled by distant supervision. The platform also provides the name of the entities and relation in Wikidata accompanied with the detailed description of that relation. Then the annotator is asked to judge whether the relation could be deduced only from the sentence semantics. We also ask the annotator to mark an instance as negative if the sentence is not complete, or the mention is falsely linked with the entity.
Relations are randomly assigned to annotators from the candidate set, and each annotator will consecutively annotate instances of the same relation before switching to next relation. To ensure the labeling quality, each instance is labeled by at least two annotators. If the two annotators have disagreements on this instance, it will be assigned to a third annotator. As a result, each instance has at least two same annotations, which will be the final decision. After the annotation, we remove relations with fewer than positive instances. For the remaining relations, we calculate the inter-annotator agreement for each relation using the free-marginal multirater kappa Randolph (2005), and keep the top relations.
2.3 Dataset Statistics
The final FewRel dataset consists of relations, each has instances. A full list of relations, including their names and descriptions, is provided in Appendix A.2. The average number of tokens in each sentence is , and there are unique tokens in total. Following recent meta-learning tasks Vinyals et al. (2016), which use separate sets of classes for training and testing, we use , , and relations for training, validation, and testing respectively. Table 2 provides a comparison of our FewRel dataset to two other popular few-shot classification datasets, Omniglot and mini-ImageNet. Table 3 provides a comparison of FewRel to the previous RC datasets, including SemEval-2010 Task 8 dataset (Hendrickx et al., 2009), ACE 2003-2004 dataset (Strassel et al., 2008), TACRED dataset (Zhang et al., 2017), and NYT-10 dataset (Riedel et al., 2010). While some RC datasets contain instances with no relations (negative), we ignore such instances for comparison.
|SemEval-2010 Task 8|
|Model||5 Way 1 Shot||5 Way 5 Shot||10 Way 1 Shot||10 Way 5 Shot|
|Meta Network (CNN)|
|Prototypical Network (CNN)|
We conduct comprehensive evaluations of vanilla RC models with simple strategies such as finetune or kNN on our new dataset. We also evaluate the recent state-of-the-art few-shot learning methods.
3.1 Task Formulation
In few-shot relation classification, we intend to obtain a function . Here defines the relations that the instances are classified into. is a support set
including instances for each relation . For relation classification, a data instance is a sentence accompanied with a pair of entities. The query data is an unlabeled instance to classify, and is the prediction of given by .
In recent research on few-shot learning, way shot setting is widely adopted. We follow this setting for the few-shot relation classification problem. To be exact, for way shot learning
3.2 Experiment Settings
We consider four types of few-shot tasks in our experiments: 5 way 1 shot, 5 way 5 shot, 10 way 1 shot, 10 way 5 shot. Under this setting, we evaluate different few-shot training strategies and state-of-the-art few-shot learning methods built upon two widely used instance encoders, CNN Zeng et al. (2014) and PCNN Zeng et al. (2015).
For both CNN and PCNN, the sentence is first represented to the input vectors by transforming each word into concatenation of word embeddings and position embeddings. In CNN, the input vectors pass a convolution layer, a max-pooling layer, and a non-linear activation layer to get the final output sentence embedding. PCNN is a variant of CNN, which replaces the max-pooling operation with a piecewise max-pooling operation.
To evaluate this two vanilla models in few-shot RC task, we first consider two training strategies, namely Finetune and kNN. For the Finetune baseline, it learns to classify all relations on the training set with CNN/PCNN, and tune parameters on the support set. We only tune the parameters of output layer, and keep other parameters unchanged. For the kNN baseline, it also jointly classifies all relations during training, while at the test time, it uses the neural networks to embed all the instances and then adopts k-nearest-neighbor (kNN) to classify the test instances.
By adapting them to relation classification, we also evaluate four recently proposed few-shot learning methods, including Meta Network Munkhdalai and Yu (2017), GNN Satorras and Estrach (2018), SNAIL Mishra et al. (2018), and Prototypical Network Snell et al. (2017). We describe briefly about these baselines in Sec. 3.3
. If you are familiar with these methods, you can safely skip that subsection. The hyperparameters of each model are selected via grid search against the validation set.
Human performance is also evaluated under 5 way 1 shot setting and 10 way 1 shot setting. A human labeler is given instances from different relations and one extra test instance. Human labelers are asked to decide which relation the test instance belongs to. Note that these labelers are not provided the name of the relations and any extra information. Since 5 way 5 shot and 10 way 5 shot settings are easier, we only evaluate performance of 5 way 1 shot and 10 way 1 shot.
3.3 Baselines of Few-shot Learning Models
Meta Network Munkhdalai and Yu (2017) is a meta learning algorithm utilizing a high level meta learner on top of the traditional classification model, or base learner, to supervise the training process. The weights of base learner are divided into two groups, fast weights and slow weights. Fast weights are generated by the meta learner, whereas slow weights are simply updated by minimizing classification loss. The fast weights are expected to help the model generalize to new tasks with very few training instances.
GNN Satorras and Estrach (2018)
tackles the few-shot learning problem by considering each supporting instance or query instance as a node in the graph. For those instances in the support sets, label information is also embedded into the corresponding node representations. Graph neural networks are then employed to propagate the information between nodes. A query instance is expected to receive information from support sets in order to make the classification. In our adaption, while the instances are encoded by CNNs, labels are represented by one-hot encoding.
SNAIL Mishra et al. (2018)
is a meta learning model that utilizes temporal convolutional neural networks and attention modules for fast learning from past experience. SNAIL arranges all the supporting instance-label pairs into a sequence and appends the query instance behind them. Such an order agrees with the temporal order of learning process where we learn information by reading supporting instances before making predictions for unlabeled instances. Temporal convolution (a 1-D convolution) is then performed along the sequence to aggregate information across different time steps and a causally masked attention model is used over the sequence to aggregate useful information from former instances to latter ones.
Prototypical Network Snell et al. (2017) is a few-shot classification model based on the assumption that for each class there exists a prototype. The model tries to find the prototypes for classes from supporting instances, and compares the distance between the query instance and each prototype under certain distance metric. Prototypical network learns a embedding function to embed each class’s instances, and computes each prototype by averaging over all the output embeddings of instances in the support set that are labeled with the corresponding class.
4 Result Analysis and Future Work
We report evaluation results in Table 4. From our preliminary experiments, PCNN with few-shot learning methods perform 3-10 percentages worse than CNN, therefore only CNN results are shown in our experimental results. From the results, we observe that integrating few-shot learning methods into CNN significantly outperforms CNN/PCNN with finetune or kNN, which means adapting few-shot learning methods for RC is promising. However, there are still huge gaps between their performance and humans’, which means our dataset is a challenging testbed for both relation classification and few-shot learning.
|Chris Bohjalian graduated from Amherst College Summa Cum Laude, where he was a member of the Phi Beta Kappa Society.||Simple Pattern|
|James Alty obtained a 1st class honours (Physics) at Liverpool University.||Common-sense Reasoning|
|He was a professor at Reed College, where he taught Steve Jobs, and replaced Lloyd J. Reynolds as the head of the calligraphy program.||Logical Reasoning|
|He and Cesare Borgia were thought to be close friends since childhood, going on to accompany one another during their studies at the University of Pisa.||Co-reference Reasoning|
In this paper, we propose a new large and high quality dataset, FewRel, for few-shot relation classification task. This dataset provides a new point of view for RC, and also a new benchmark for few-shot learning. Through the evaluation of different few-shot learning methods, we find even the best model performs much worse than humans, which suggests there is still large space for few-shot learning methods to improve.
The most challenging characteristic of our dataset is the diversity in expressing the same relation. We provide some examples from FewRel in Table 5, showing different reasoning modes needed for classifying some instances. Future researches may consider incorporating commonsense knowledge or improved causal modules.
This work is supported by the National Natural Science Foundation of China (NSFC No. 61572273, 61532010). This work is also funded by the Natural Science Foundation of China (NSFC) and the German Research Foundation (DFG) in Project Crossmodal Learning, NSFC 61621136008 / DFC TRR-169. Hao Zhu is supported by Tsinghua University Initiative Scientific Research Program. We thank all annotators for their hard work. We also thank all members from Tsinghua NLP Lab for their strong support for annotator recruitment.
- Bengio (2012) Yoshua Bengio. 2012. Deep learning of representations for unsupervised and transfer learning. In Proceedings of ICML.
Carlson et al. (2009)
Andrew Carlson, Justin Betteridge, Estevam R Hruschka Jr, and Tom M Mitchell.
Coupling semi-supervised learning of categories and relations.In
Proceedings of the NAACL HLT 2009 Workshop on Semi-supervised Learning for Natural Language Processing, pages 1–9. Association for Computational Linguistics.
Rich Caruana. 1995.
Learning many related tasks at the same time with backpropagation.In Proceedings of NIPS.
- Donahue et al. (2014) Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. 2014. Decaf: A deep convolutional activation feature for generic visual recognition. In Proceedings of ICML.
- Feng et al. (2018) Jun Feng, Minlie Huang, Li Zhao, Yang Yang, and Xiaoyan Zhu. 2018. Reinforcement learning for relation classification from noisy data. In Proceedings of AAAI.
- Finn et al. (2017) Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of ICML.
- Gormley et al. (2015) Matthew R. Gormley, Mo Yu, and Mark Dredze. 2015. Improved relation extraction with feature-rich compositional embedding models. In Proceedings of EMNLP.
Hendrickx et al. (2009)
Iris Hendrickx, Su Nam Kim, Zornitsa Kozareva, Preslav Nakov, Diarmuid Ó Séaghdha, Sebastian Padó, Marco Pennacchiotti, Lorenza Romano, and Stan Szpakowicz. 2009.Semeval-2010 task 8: Multi-way classification of semantic relations between pairs of nominals. In Proceedings of SemEval@ACL.
- Hoffmann et al. (2011) Raphael Hoffmann, Congle Zhang, Xiao Ling, Luke Zettlemoyer, and Daniel S Weld. 2011. Knowledge-based weak supervision for information extraction of overlapping relations. In Proceedings of ACL.
- Huang and Wang (2017) Yi Yao Huang and William Yang Wang. 2017. Deep residual learning for weakly-supervised relation extraction. In Proceedings of EMNLP.
- Ji et al. (2017) Guoliang Ji, Kang Liu, Shizhu He, Jun Zhao, et al. 2017. Distant supervision for relation extraction with sentence-level attention and entity descriptions. In Proceedings of AAAI.
- Koch et al. (2015) Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. 2015. Siamese neural networks for one-shot image recognition. In Proceedings of ICML.
- Lake et al. (2015) Brenden M. Lake, Ruslan Salakhutdinov, and Joshua B. Tenenbaum. 2015. Human-level concept learning through probabilistic program induction. Science.
- Lin et al. (2016) Yankai Lin, Shiqi Shen, Zhiyuan Liu, Huanbo Luan, and Maosong Sun. 2016. Neural relation extraction with selective attention over instances. In Proceedings of ACL.
- Liu et al. (2017) Tianyu Liu, Kexiang Wang, Baobao Chang, and Zhifang Sui. 2017. A soft-label method for noise-tolerant distantly supervised relation extraction. In Proceedings of EMNLP.
- Luo et al. (2017) Bingfeng Luo, Yansong Feng, Zheng Wang, Zhanxing Zhu, Songfang Huang, Rui Yan, and Dongyan Zhao. 2017. Learning with noise: Enhance distantly supervised relation extraction with dynamic transition matrix. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 430–439.
- Ma et al. (2016) Yukun Ma, Erik Cambria, and Sa Gao. 2016. Label embedding for zero-shot fine-grained named entity typing. In COLING.
- (18) Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. Distant supervision for relation extraction without labeled data. In Proceedings of ACL-IJCNLP.
- Mishra et al. (2018) Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. 2018. A simple neural attentive meta-learner. In Proceedings of ICLR.
- Mooney and Bunescu (2006) Raymond J Mooney and Razvan C Bunescu. 2006. Subsequence kernels for relation extraction. In Proceedings of NIPS.
- Munkhdalai and Yu (2017) Tsendsuren Munkhdalai and Hong Yu. 2017. Meta networks. In Proceedings of ICML.
- Randolph (2005) Justus J Randolph. 2005. Free-marginal multirater kappa (multirater free): an alternative to fleiss’ fixed-marginal multirater kappa. In Proceedings of JLIS.
- Ravi and Larochelle (2017) Sachin Ravi and Hugo Larochelle. 2017. Optimization as a model for few-shot learning. In Proceedings of ICLR.
- (24) Sebastian Riedel, Limin Yao, and Andrew McCallum. Modeling relations and their mentions without labeled text. In Proceedings of ECML-PKDD.
- Riedel et al. (2010) Sebastian Riedel, Limin Yao, and Andrew D McCallum. 2010. Modeling relations and their mentions without labeled text. In Proceedings of ECML-PKDD.
- Roth et al. (2013) Benjamin Roth, Tassilo Barth, Michael Wiegand, and Dietrich Klakow. 2013. A survey of noise reduction methods for distant supervision. In Proceedings of the 2013 workshop on Automated knowledge base construction, pages 73–78. ACM.
- Santoro et al. (2016) Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. 2016. Meta-learning with memory-augmented neural networks. In Proceedings of ICML.
- Satorras and Estrach (2018) Victor Garcia Satorras and Joan Bruna Estrach. 2018. Few-shot learning with graph neural networks. In Proceedings of ICLR.
- Snell et al. (2017) Jake Snell, Kevin Swersky, and Richard S. Zemel. 2017. Prototypical networks for few-shot learning. In Proceedings of NIPS.
- Strassel et al. (2008) Stephanie Strassel, Mark A. Przybocki, Kay Peterson, Zhiyi Song, and Kazuaki Maeda. 2008. Linguistic resources and evaluation techniques for evaluation of cross-document automatic content extraction. In Proceedings of LREC.
- Surdeanu et al. (2012) Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati, and Christopher D Manning. 2012. Multi-instance multi-label learning for relation extraction. In Proceedings of EMNLP.
- Vinyals et al. (2016) Oriol Vinyals, Charles Blundell, Tim Lillicrap, Koray Kavukcuoglu, and Daan Wierstra. 2016. Matching networks for one shot learning. In Proceedings of NIPS.
- Wu et al. (2017) Yi Wu, David Bamman, and Stuart Russell. 2017. Adversarial training for relation extraction. In Proceedings of EMNLP.
Xie et al. (2016)
Ruobing Xie, Zhiyuan Liu, Jia Jia, Huanbo Luan, and Maosong Sun. 2016.
Representation learning of knowledge graphs with entity descriptions.In AAAI.
- Xin et al. (2018) Ji Xin, Hao Zhu, Xu Han, Zhiyuan Liu, and Maosong Sun. 2018. Put it back: Entity typing with language model enhancement. In Proceedings of EMNLP.
- Yu et al. (2018) Mo Yu, Xiaoxiao Guo, Jinfeng Yi, Shiyu Chang, Saloni Potdar, Yu Cheng, Gerald Tesauro, Haoyu Wang, and Bowen Zhou. 2018. Diverse few-shot text classification with multiple metrics. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), volume 1, pages 1206–1215.
- Zelenko et al. (2002) Dmitry Zelenko, Chinatsu Aone, and Anthony Richardella. 2002. Kernel methods for relation extraction. JMLR.
- Zeng et al. (2015) Daojian Zeng, Kang Liu, Yubo Chen, and Jun Zhao. 2015. Distant supervision for relation extraction via piecewise convolutional neural networks. In Proceedings of EMNLP.
- Zeng et al. (2014) Daojian Zeng, Kang Liu, Siwei Lai, Guangyou Zhou, and Jun Zhao. 2014. Relation classification via convolutional deep neural network. In Proceedings of COLING.
- Zeng et al. (2017) Wenyuan Zeng, Yankai Lin, Zhiyuan Liu, and Maosong Sun. 2017. Incorporating relation paths in neural relation extraction. In Proceedings of EMNLP.
- Zeng et al. (2018) Xiangrong Zeng, Shizhu He, Kang Liu, and Jun Zhao. 2018. Large scaled relation extraction with reinforcement learning. In Proceedings of AAAI.
- Zhang et al. (2017) Yuhao Zhang, Victor Zhong, Danqi Chen, Gabor Angeli, and Christopher D. Manning. 2017. Position-aware attention and supervised data improve slot filling. In Proceedings of EMNLP.