Accepted as a long paper in EMNLP 2021 (Conference on Empirical Methods in Natural Language Processing).
Relation extraction aims to detect the relation between two entities contained in a sentence, which is the cornerstone of various natural language processing (NLP) applications, including knowledge base enrichment trisedya-etal-2019-neural, biomedical knowledge discovery DBLP:conf/ijcai/GuoN0C20, and question answering DBLP:conf/ijcai/HanCW20. Conventional neural methods miwa-bansal-2016-end; tran-etal-2019-relation train a deep network through a large amount of labeled data with extensive relations, so that the model can recognize these relations during the test phase. Although impressive performance has been achieved, these methods are difficult to adapt to novel relations that have never been seen in the training process. In contrast, humans can identify new relations with very few examples. It is thus of great interest to enable the model to generalize to new relations with a handful of labeled instances.
Inspired by the success of few-shot learning in the computer vision (CV) communityDBLP:conf/cvpr/SungYZXTH18; DBLP:conf/iclr/SatorrasE18, han-etal-2018-fewrel first introduce the task of few-shot relation extraction (FSRE). FSRE requires models to be capable of handling classification of novel relations with scarce labeled instances. A popular framework for few-shot learning is meta-learning DBLP:conf/icml/SantoroBBWL16; DBLP:conf/nips/VinyalsBLKW16, which optimizes the model through collections of few-shot tasks sampled from the external data containing disjoint relations with novel relations, so that the model can learn cross-task knowledge and use the knowledge to acclimate rapidly to new tasks. A simple yet effective algorithm based on meta-learning is prototypical network (DBLP:conf/nips/SnellSZ17)
, aiming to learn a metric space in which a query instance is classified according to its distance to class prototypes. Recently, many worksDBLP:conf/aaai/GaoH0S19; DBLP:conf/icml/QuGXT20; DBLP:conf/cikm/YangZDHHC20 for FSRE are in line with prototypical networks, which achieve remarkable performance. Nonetheless, the difficulty of distinguishing relations varies in different tasks DBLP:journals/corr/abs-2007-06240, depending on the similarity between relations. As illustrated in Figure LABEL:intro, there are easy few-shot tasks whose relations are quite different, so that they can be consistently well-classified, and also hard few-shot tasks with subtle inter-relation variations which are prone to misclassification. Current FSRE methods struggle with handling the hard tasks given limited labeled instances due to two main reasons. First, most works mainly focus on general tasks to learn generalized representations, and ignore modeling subtle and local differences of relations effectively, which may hinder these models from dealing with hard tasks well. Second, current meta-learning methods treat training tasks equally, which are randomly sampled and have different degrees of difficulty. The generated easy tasks can overwhelm the training process training and lead to a degenerate model.
To fill this gap, this paper proposes a Hybrid Contrastive Relation-Prototype (HCRP) approach, which focuses on improving the performance on hard FSRE tasks. Concretely, we first propose a hybrid prototypical network, capable of capturing global and local features to generate the informative class prototypes. Next, we present a novel relation-prototype contrastive learning method, which leverages relation descriptions as anchors, and pulls the prototype of the same class closer in representation space and pushes those of different classes away. In this way, the model gains diverse and discriminative prototype representations, which could be beneficial to distinguish the subtle difference of confusing relations in hard few-shot tasks. Furthermore, we design a task-adaptive training strategy based on focal loss (DBLP:conf/iccv/LinGGHD17) to learn more from hard tasks, which allocates dynamic weights to different tasks according to task difficulty. Extensive experiments on two large-scale benchmarks show that our model significantly outperforms the baselines. Ablation and case studies demonstrate the effectiveness of the proposed modules. Our code is available at https://github.com/hanjiale/HCRP .
The contributions of this paper are summarized as follows:
We present HCRP to explore task difficulty as useful information for FSRE, which boosts hybrid prototypical network with relation-prototype contrastive learning to capture diverse and discriminative representations.
We design a novel task adaptive focal loss to focus training on hard tasks, which enables the model to achieve higher robustness and better performance.
Qualitative and quantitative experiments on two FSRE benchmarks demonstrate the effectiveness of our model.
2 Related Work
2.1 Few-shot Relation Extraction
Relation extraction is a foundational and important task in NLP and attracts many recent attentions DBLP:journals/corr/abs-2104-07650; nan-etal-2020-reasoning; nan2021dialogue. Few-shot relation extraction aims to predict novel relations by exploring a few labeled instances. han-etal-2018-fewrel first present a large-scale benchmark FewRel for FSRE. DBLP:conf/aaai/GaoH0S19 design a hybrid attention-based prototypical network to highlight the crucial instances and features. ye-ling-2019-multi propose a prototypical network with multi-level matching and aggregation. sun-etal-2019-hierarchical present a hierarchical attention prototypical network to enhance the representation ability of semantic space. DBLP:conf/icml/QuGXT20 utilize an external relation graph to study the relationships between different relations. wang-etal-2020-learning apply added relative position information and syntactic relation information to enhance prototypical networks. DBLP:conf/cikm/YangZDHHC20 fuse text descriptions of relations and entities by a collaborative attention mechanism. And yang-etal-2021-entity introduce the inherent concepts of entities to provide clues for relation classification. There are also some methods baldini-soares-etal-2019-matching; peng-etal-2020-learning
combining prototypical networks with pre-trained language models, which achieve impressive results.However, the task difficulty of FSRE has not been explored. In this work, we focus on the hard tasks and propose a hybrid contrastive relation-prototype method to better model subtle variations across different relations.
2.2 Contrastive Learning
Contrastive learning DBLP:journals/corr/abs-2011-00362 has gained popularity recently in the CV community. The core idea is to contrast the similarities and dissimilarities between data instances, which pulls the positives closer and pushes negatives away simultaneously. CPC DBLP:journals/corr/abs-1807-03748
proposes a universal unsupervised learning approach. MoCoDBLP:conf/cvpr/He0WXG20 presents a mechanism for building dynamic dictionaries for contrastive learning. SimCLR DBLP:conf/icml/ChenK0H20 improves contrastive learning by using larger batch size and data augmentation. DBLP:conf/nips/KhoslaTWSTIMLK20 extend the self-supervised contrastive approach to the supervised setting. Nan_2021_CVPR propose a dual contrastive learning approach for video grounding. There are also some applications of contrastive learning in the field of NLP. DBLP:journals/corr/abs-2005-12766 employ back translation and MoCo to learn sentence-level representations. gunel2021supervised design supervised contrastive learning for pre-trained language model fine-tuning. Inspired by these works, we propose a heterogeneous relation-prototype contrastive learning in a supervised way to obtain more discriminative representations.
3 Task Definition
We follow a typical few-shot task setting, namely the -way--shot setup, which contains a support set and a query set . The support set includes novel classes, each with labeled instances. The query set contains the same classes as . And the task is evaluated on query set , trying to predict the relations of instances in . What’s more, an auxiliary dataset is given, which contains abundant base classes, each with a large number of labeled examples. Note the base classes and novel classes are disjoint with each other. The few-shot learner aims to acquire knowledge from base classes and use the knowledge to recognize novel classes. One popular approach is the meta-learning paradigm DBLP:conf/nips/VinyalsBLKW16, which mimics the few-shot learning settings at training stage. Specifically, in each training iteration, we randomly select classes from base classes, each with instances to form a support set . Meanwhile, instances are sampled from the remaining data of the classes to construct a query set . The model is optimized by collections of few-shot tasks sampled from base classes, so that it can rapidly adapt to new tasks.
For an FSRE task, each instance consists of a set of samples , where denotes a natural language sentence, indicates a pair of head entity and tail entity, and is the relation label. The name and description for each relation are also provided as auxiliary support evidence for relation extraction. For example, for a relation with its relation id “P726” in a dataset that we use, we can obtain its name “candidate” and description “person or party that is an option for an office in an election”.
In this section, we present the details of our proposed HCRP approach. The overall learning framework is illustrated in Figure 2. The inputs are -way--shot tasks (sampled from the auxiliary dataset ), where each task contains a support set and a query set . Meanwhile, we take the names and descriptions of these classes (i.e., relations) as inputs as well. HCRP consists of three components. The hybrid prototype learning module generates informative prototypes by capturing global and local features, which can better capture the subtle differences of relations. The relation-prototype contrastive learning component is then used to leverage the relation label information to further enhance the discriminative power of the prototype representations. Finally, a task adaptive focal loss is introduced to encourage the model to focus training on hard tasks.
4.1 Hybrid Prototype Learning
We employ BERT devlin-etal-2019-bert as the encoder to obtain contextualized embeddings of query instances and support instances , where and are the sentence lengths of the -th query instance and -th support instance in class respectively, and is the size of the resulting contextualized representations. For each relation, we concatenate the name and description and feed the sequence into the BERT encoder to obtain relation embeddings , where is the length of relation description .
For instances in and , the global features and are obtained by concatenating the hidden states corresponding to start tokens of two entity mentions following baldini-soares-etal-2019-matching. The global features of relations are obtained by the hidden states corresponding to [CLS] token (converted to dimension with a transformation). For each relation , we average the global features of the supporting instances following the work of DBLP:conf/nips/SnellSZ17, and further add the global feature of relation to form global prototype representation.
While global prototypes are capable of capturing general data representations, such representations may not readily capture useful local information within specific RSRE tasks. To better handle the hard FSRE tasks with subtle differences among highly similar relations, we further propose local prototypes to highlight key tokens in an instance that are essential to characterize different relations.
For relation , we first calculate the local feature of the -th support instance as:
where is the -th row of a matrix, is an operation that sums all elements for each row in a matrix.