DeepAI
Log In Sign Up

Function-words Enhanced Attention Networks for Few-Shot Inverse Relation Classification

The relation classification is to identify semantic relations between two entities in a given text. While existing models perform well for classifying inverse relations with large datasets, their performance is significantly reduced for few-shot learning. In this paper, we propose a function words adaptively enhanced attention framework (FAEA) for few-shot inverse relation classification, in which a hybrid attention model is designed to attend class-related function words based on meta-learning. As the involvement of function words brings in significant intra-class redundancy, an adaptive message passing mechanism is introduced to capture and transfer inter-class differences.We mathematically analyze the negative impact of function words from dot-product measurement, which explains why message passing mechanism effectively reduces the impact. Our experimental results show that FAEA outperforms strong baselines, especially the inverse relation accuracy is improved by 14.33

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

08/17/2019

Message Passing Attention Networks for Document Understanding

Most graph neural networks can be described in terms of message passing,...
08/16/2019

Few-shot Text Classification with Distributional Signatures

In this paper, we explore meta-learning for few-shot text classification...
06/07/2022

Learning Attention-based Representations from Multiple Patterns for Relation Prediction in Knowledge Graphs

Knowledge bases, and their representations in the form of knowledge grap...
02/14/2021

Relation-aware Graph Attention Model With Adaptive Self-adversarial Training

This paper describes an end-to-end solution for the relationship predict...
01/10/2021

Adaptive Prototypical Networks with Label Words and Joint Representation Learning for Few-Shot Relation Classification

Relation classification (RC) task is one of fundamental tasks of informa...
09/29/2020

Message Passing Neural Processes

Neural Processes (NPs) are powerful and flexible models able to incorpor...

1 Introduction

Figure 1: The left figure shows the words attention visualization of TD-PROTO, where a darker unit indicates a higher value. We observe content words region is more deeper than function words. The right is an example of RSIRC. This is a 2-way-1-shot setup-each task involves two relations, and each relation has one support instance. and indicate head and tail entities respectively.

Relation classification (RC) aims to classify the relation between two given entities based on their related context. Specifically, given a sentence in a natural language, a set of relation names, and two entities, we want to determine the correct relation between these two entities. RC is useful for many natural language processing applications, such as information retrieval

[Bao et al.2020], question answering [Ma et al.2021]

. Most existing approaches to RC are based on supervised learning and the datasets used in training heavily depend on human-annotated data, which limit their performance on classifying the relations with insufficient instances. Therefore, making the RC models capable of handling relations with few training instances becomes a crucial challenge in AI. Inspired by the success of few-shot learning methods in the computer vision community

[Wu et al.2021, Yang et al.2021a], Han et al. [Han et al.2018] first investigated the problem of few-shot relations classification (FSRC) and proposed the dataset FewRel1.0 for evaluating the performance of FSRC models. Since then, several other FSRC models have been reported in the literature and they demonstrate remarkable performance on FewRel1.0. [Gao et al.2019a, Qu et al.2020, Yang et al.2021b].

However, our experiments show that the performance of existing models for FSRC is significantly reduced when one relation is the inverse of another relation in the given set of relations. Figure 1 shows a few-shot inverse relations classification (FSIRC) task where the set of relations contain two relations ‘participant’ and ‘participant of’. From the two support instances, we can see that the relation ‘participant’ is the inverse relation of ‘participant of’. We note that ‘participant’ and ‘participant of’ have the same content word ‘participation’ but different function words ‘of’ and ‘his’. Existing models [Sun et al.2019, Bao et al.2020] for FSRC focus on characterising the differences between content words but ignore the differences of function words. As a result, these models do not perform well when one relation is the inverse of another relation.

In practical applications, it is often the case that one relation is the inverse of another relation. For example, we found 21.25% of relations in FewRel1.0 dataset are inverse. However, it is useful but challenging to classify relations with the presence of inverse relations in few-shot scenarios, as the lack of sufficient samples makes it hard to determine the important distribution of class-related function words.

In order to address this issue, we propose a new approach to the problem of FSIRC, called Function-words Adaptively Enhanced Attention Networks (FAEA). As shown in Figure 2, in the instance encoder, besides considering the importance of keywords, we use a hybrid attention to capture function words. Specifically, the class-general attention mechanism learns general function-words importance distribution. As function words appearing in the same phase with keywords are more likely to be informative [Zhang et al.2020], we design a class-specific attention by strengthening function-words importance adjacent to keywords. From our experience, in some cases, function-words far from keywords are also important. For this reason, we introduce semantic-related attention for computing the direct semantic relevance between function-words and keywords. However, the introduction of function words may increase the intra-class differences. So we present a message passing mechanism to capture and transfer inter-class differences and intra-class commonalities between instances. But when inter-class differences are large, they will bring in noises and thus useful relation semantics can be lost. To avoid this issue, we adaptively control proportion of transferred inter-class message. Our experiments show that FAEA significantly outperforms major baseline models for FSIRC.

In a nutshell, our contributions are listed as follows:

  • We present FAEA that learns to capture class-related function words. This is achieved by introducing an adaptive message passing mechanism to obtain phase-level local features and thus discriminative representations.

  • We mathematically show that the involvement of function words will increase intra-class differences from dot-product measurement and the designed message passing mechanism effectively reduces the redundancy.

  • We conduct experiments on two benchmarks and show that our model significantly outperforms the baselines, especially for FSIRC task. Ablation experiments demonstrate the effectiveness of the proposed modules.

Figure 2: The overall framework of FAEA. The input of Instance Encoder is an instance with corresponding relation description.

2 Related Work

Few shot relation classification aims to predict novel relations by exploring a few labeled instances. Existing methods can be mainly divided into two categories: Gradient-based models and metric-based models. A gradient-based model [Finn et al.2017, Qu et al.2020] can rapidly adapt the model to a given task via a small number of gradient update steps. MAML[Finn et al.2017] is a representative model, learning a suitable initialization of model parameters by learning from base classes, and transferring these parameters to novel classes in a few gradient steps. Metric-based models [Snell et al.2017, Gao et al.2019a, Ye and Ling2019, Wen et al.2021] leverage similarity information between samples to identify novel classes with few samples. As a representative model, Prototypical Network [Snell et al.2017] takes the mean of support samples. Some models[Ye and Ling2019, Wen et al.2021] add attention mechanisms to enhance the prototypical network for highlighting crucial instances and features, but they ignore the intra-sentence difference information. Some models [Sun et al.2019, Bao et al.2020, Yang et al.2021b] capture local content words to obtain fine-grained information and ignore function words importance. However, inverse relations of FSRC has not been effectively handled by current models. In this work, we focus on inverse relations and propose a hybrid function words attention method to model subtle variations across inverse relations.

3 Our Method

3.1 Problem Statement

FSRC is defined as a task to predict the relation between the entity pair mentioned in a query instance (i.e., a sentence containing the entities and ), given a support set and a relation set , and , where means there is a relation between the entity pair in instance , and is corresponding relation description. is the number of relations and each relation with quite small labeled instances. For a FSIRC task, relation set includes some pairs of inverse relations. For example‘participant’ and “participant of” are inverse relations, and their relation descriptions are “person or organization that actively takes part in the event” and “event a person or an organization was a participant in”, respectively.

3.2 Overall Framework

As shown in Figure 2, our model consists of three parts:

  • Instance Encoder Given an instance and entity pair, we employ instance-level global encoder and phase-level local encoder to encode instance into an embedding.

  • Function-words Enhanced Attention Phase-level local encoder utilize function-words enhanced attention to capture important function words in instances.

  • Adaptive Message Passing After computing embeddings, we transfer commonalities between same class instances and differences between different class instances.

3.3 Instance Encoder

Given an instance mentioning two entities with words, we employ BERT [Devlin et al.2019] as the encoder to obtain corresponding embeddings , where each word embedding and is embedding dimension. For -th relation , we feed the name and description into the BERT to obtain relation word embeddings , and features of relations are obtained by the hidden states corresponding to [CLS] token (converted to dimension with a transformation).

For instance in and query instance , our model generates global instance embeddings and local phase embeddings to form hybrid instance embeddings and . The following takes as an example to explain.

Instance-level Global Encoder

The global features are obtained by concatenating the hidden states corresponding to start tokens of two entity mentions following [Soares et al.2019].

Phase-level Local Encoder

The main process consists of learning a keyword attention vector

and a function-word attention vector . can be computed as follows:

(1)

where the memory unit is a trainable parameter. It can help us to select general keywords from instances.

3.4 Function-words Enhanced Attention

We utilize class-general attention to learn general function words importance distribution and leverage class-specific attention consisting of constituent attention and semantic-related attention to estimate class-specific importance.

Class-general attention

We downweigh importance of words related to and upweigh words importance irrelated to to get general function-words importance , where is an all-one matrix:

(2)

Class-specific attention

Considering function words importance varying by classes, we learn a constituent prior matrix and semantic-related matrix and using them to strengthen attention of function words adjacent to keywords.

The element in

meaning the probability that

and of instance belong to same phase, is obtained as follows, where is the -th row of a matrix :

(3)
(4)
(5)

We compute the score representing the tendency that and right neighbor belong to the same phase by scaled dot-product attention. We constrain it to either belong to its right neighbor or left neighbor, which is implemented by applying a softmax function to two attention links of . As and possibly have different values, we average its two attention links.

Then we use a self-attention mechanism to obtain semantic-related matrix to attend necessary function words far from keywords:

(6)

Next we find the keywords index in and strengthen related function words from matrix and according to index, is used to get indexes of the top- largest attention keyword for a matrix where is the number of keywords:

(7)

Finally, the model uses , and to form hybrid function-word attention vector , where is hyper-parameter:

(8)
(9)

All in all, inspired by MAML[Finn et al.2017]

learning general model parameters and fine-tunning them to adapt specific class, we design class-general and class-specific attention to learn function-words variance in few-shot setting.

3.5 Adaptive Message Passing

Message Passing model is used to reduce intra-class redundancy and adaptively control proportion of transferred inter-class message. Firstly, we construct a directed graph , where is a set of instances features with and is adjacency matrix, where denotes row-wise concatenation, denotes the -th row of .

(10)

We design a new node updating way that captures and transfers inter-class differences and intra-class commonalities between instance nodes: where and denote the different and same class neighbor set as , respectively. is the degree of instance node .

(11)

For -th relation, we average supporting features to form prototype representation following [Snell et al.2017].

(12)

With prototype relations, the model computes the probability of the relations for the query instance as follows:

(13)

The final objective function is formally written as: , where is class label, and is estimated probability for the class .

In a short, we design a new node updating method to capture inter-class differences additionally.

3.6 Theoretical Analysis

In our theoretical analysis, we show that the involvement of function words will increase intra-class differences(Theorem1) and the designed message passing mechanism makes different class nodes become discriminative and same class nodes similar.(Theorem2) Detailed theoretical proof is shown in the Appendix.

Given any two instances and , let the corresponding keywords representations be and ; the function words representations be and ; and the instance representations considering function words be and .

Theorem 1.

Let

(14)

where indicates the inner product of vectors.

If , then

(15)

This theorem shows that when function words are taken into account, the similarity degree between instances and becomes smaller and thus the intra-class difference is increased.

Theorem 2.

Given any two function words and and their instance representations, define the similarity measure between and as

(16)

The message passings between same class instances and different classes are defined as, respectively

(17)
(18)

If and belong to the same class, then

(19)

If and belong to different classes, then

(20)

This result shows that, compared with the original similarity , if becomes smaller, then the message passing transfer transfers different information, otherwise it transfers similar information.

4 Experiments

Encoder Model 5-way-1-shot 5-way-5-shot 10-way-1-shot 10-way-5-shot
CNN Proto-CNN [Snell et al.2017] 72.65 / 74.52 86.15 / 88.40 60.13 / 62.38 76.20 / 80.45
Proto-HATT [Gao et al.2019a] 75.01 / – – 87.09 / 90.12 62.48 / – – 77.50 / 83.05
MLMAN [Ye and Ling2019] 78.85 / 82.98 88.32 / 92.66 67.54 / 73.59 79.44 / 87.29
Bert Proto-Bert [Snell et al.2017] 82.92 / 80.68 91.32 / 89.60 73.24 / 71.48 83.68 / 82.89
MAML [Finn et al.2017] 82.93 / 89.70 86.21 / 93.55 73.20 / 83.17 76.06 / 88.51
GNN [Satorras and Estrach2018] 74.21 / 75.66 86.16 / 89.06 67.98 / 70.08 73.65 / 76.93
BERT-PAIR [Gao et al.2019b] 85.66 / 88.32 89.48 / 93.22 76.84 / 80.63 81.76 / 87.02
REGRAB [Qu et al.2020] 87.93 / 90.30 92.58 / 94.25 80.52 / 84.09 87.02 / 89.93
TD-Proto [Sun et al.2019] 83.43 / 84.53 90.26 / 92.38 72.45 / 74.32 82.10 / 85.19
ConceptFERE [Yang et al.2021b] 87.21 / 89.21 90.53 / 93.98 73.56 / 75.72 83.29 / 86.21
TPN [Wen et al.2021] – – / 80.14 – – / 93.60 – – / 72.67 – – / 89.83
CTEG [Wang et al.2020] 84.72 / 88.11 92.52 / 95.25 76.01 / 81.29 84.89 / 91.33
FAEA(ours)+AMP 90.81 / 95.10 94.24 / 96.48 84.22 / 90.12 88.74 / 92.72
MTB [Soares et al.2019] – – / 93.86 – – / 97.06 – –/89.20 – – / 94.27
CP [Peng et al.2020] – – / 95.10 – – / 97.10 – – / 91.20 – – / 94.70
FAEA+CP 94.11 / 96.36 89.55 / 97.85 86.59 / 93.82 93.64 / 96.29
Table 1: Accuracy (%) of few-shot classification on the FewRel 1.0 validation / test set.
Model
5-way
1-shot
5-way
5-shot
10-way
1-shot
10-way
5-shot
Proto-CNN 35.09 49.37 22.98 35.22
Proto-BERT 40.12 51.50 26.45 36.93
Proto-ADV 42.21 58.71 28.91 44.35
Bert-Pair 67.41 78.57 54.89 66.85
Our 73.58 90.10 62.98 80.51
Table 2: Accuracy (%) of few shot classification on the FewRel2.0 domain adaptation test set.
Model 2way1shot 4way1shot 5way1shot 5way3shot 5way5shot
R - I R - I R - I R - I R - I
Proto-HATT 83.26 53.62 78.61 49.72 75.01 62.13 80.53 68.15 87.09 73.02
Proto-Bert 87.54 54.96 83.44 51.25 82.92 66.23 85.23 67.96 91.32 72.53
Bert-Pair 91.21 56.20 87.44 54.87 85.66 67.53 88.42 69.92 89.48 71.21
TD-Proto 89.69 53.81 85.36 52.31 83.25 63.21 84.21 65.32 85.21 70.19
ConceptFERE 92.57 62.21 88.89 59.76 87.21 69.47 88.79 71.15 90.53 76.24
Our 97.65 78.96 92.21 75.45 90.81 80.02 91.96 82.26 94.24 85.63
Table 3: Accuracy (%) of different few-shot settings on FewRel1.0. ‘R’ stands for the standard few-shot setting and ‘I’ stands for evaluating including inverse relations.
Model Id
5-way
1-shot
10-way
1-shot
Our 1 90.81 84.22
– phase-level encoding 2 84.98 77.02
– function word attn 3 87.52 80.92
– general attn 4 88.81 82.08
– constituent attn 5 88.93 82.69
– related attn 6 89.43 83.51
– message passing 7 90.01 83.62
– mean 8 90.32 83.86
Table 4: Ablation study on FewRel 1.0 validation set showing accuracy (%).

4.1 Baselines

4.2 Datasets and Settings

We evaluate our model on FewRel1.0 [Han et al.2018] and FewRel2.0 [Gao et al.2019b], consisting of 100 relations and each with 700 labeled instances. Our experiments follow the splits used in official benchmarks, which split the dataset into 64 base classes for training, 16 classes for validation, and 20 classes for testing.

We evaluate our model in terms of the averaged accuracy on query set of multiple N-way-K-shot tasks. According to [Gao et al.2019a, Gao et al.2019b], we choose to be 5 and 10, and to be 1 and 5 to form 4 scenarios. In addition, we take base-uncased BERT as the encoder of 768 dimensions for fair comparison. The input max length is set to 128. Besides, the AdamW optimizer is applied with the learning rate and weight decay as and . Furthermore, hyper-parameter is set to and is randomly initialized following [Sun et al.2019].

4.3 Results

Figure 3: A 2-way-1-shot inverse relation task example. The upper part visualizes the attention score of each word by FAEA and TD-Proto, the middle part is similarities between the query instance and support instances. The lower part is attention scores of words in query under different models.
Figure 4: The similarity of intra-class and inter-class between some classes computed by dot-product
Figure 5: The upper part is T-SNE plots of instance embeddings of ”notable work”,”has part” and ”part of” with or without mean. The lower part is attention scores of words in ”notable work” under different models.

Performance on FSRC

As shown in the upper part of Table 1, our method outperforms the strong baseline models by a large margin, especially in 1-shot scenarios. Specifically, we improve 5-way-1-shot and 10-way-1-shot tasks points and points in terms of accuracy, demonstrating the superior ability. In addition, our method achieves good performance on FewRel2.0, as shown in Table 2.

  • Proto and GNN, as widely-used baselines for few-shot learning, perform not well on FSRC. Unlike low-level patterns can be shared across tasks in computer vision, words that are informative for one task may not be relevant for other tasks. But these models ignore such local words importance variations learning. But FAEA leverages phase-level attention to attend local features.

  • TD-PROTO and ConceptFERE also use semantic-level attention to explore content words, but neglect function words maintaining syntactic structure differences. Since FAEA captures function words to form fine-grained features, it obtains better performance.

  • When computing relation prototypes, Proto-HATT and TPN utilize intra-class commonalities not considering inter-class differences. FAEA captures and leverages differences to get more discriminative representations.

Performance on FSIRC

To further illustrate the model effectiveness for FSIRC, we evaluate models on FewRel1.0 validation set with different settings, as shown in Table 3. Random is general evaluation setting, which samples 10,000 test tasks randomly from validation relations. Inverse represents each evaluated task includes inverse relations. As we can see, the baselines achieve good accuracy under random settings but drops significantly under inverse settings, around 26.98 points in 1-shot scenarios, which illustrates that FSIRC tasks are extremely challenging. FAEA gains the best accuracy, especially under inverse settings, proving that it can effectively capture function words and handle FSIRC tasks.

5 Analysis

5.1 Analysis of function words attention

This section discusses the effect of function-words attention. As shown in the Table 4, removing phase-level (Model 2) and function words attention (Model 3) severely decrease the performance, indicating function words are also essential to represent relations. Furthermore, as shown in Figure 3, with the help of function words attention, we highlight ‘are’ and ‘of’ to form the phase “are part of”, which appears in query and support instance of class “part of” at the same time, then this support instance get higher similarity, and our model correctly classifies the query.

To demonstrate the effectiveness of three components of function-words attention, from model 4,5,6 of Table 4, we can see there is a performance decline if removing three components separately. As shown in lower part of Figure 3, we find that TD-Proto mainly attends content words such as ‘parts’,‘Lake’. FAEA without general attention not only enhances function words importance but also content words irrelated to keywords, such as ‘mountains’,‘Hakkodda’. And FAEA without constituent attention enhances some keywords-irrelated function words such as ‘along’, ‘and’. FAEA without related attention tends to decrease some related function words importance far away from the keywords such as ‘are’. FAEA further captures correct function words to form “are three parts of” , which demonstrates that three components all contribute to enhance function words importance.

5.2 Analysis of message passing

As shown in Table 4, we compare models without message passing mechanism (Model 7) and message passing without mean (Model 8). We observe that considering message passing achieves higher accuracy, and adding the mean to control the proportion of transferred inter-class message further improves the performance.

To futher demonstrate the effectiveness of message passing, we choose some classes and visualize the similarity shown in Figure 4. We can see that only considering content words, the inter-class information of inverse relations has a high similar score. And the introduction of function words can effectively reduce it. But we also find function words reduce the similarity of intra-class information from left part. The designed message passing mechanism without mean can effectively increase intra-class commonalities and keep inter-class differences. But from right part, we observe when the inter-class differences are large enough, with the message passing mechanism, the similarity of the inter-class sharp decline, it will destroy original relation semantic.

Specifically, from upper part of Figure 5, we can see instances of “part of” and “notable work” are quite different with similarity of 0.79, much lower than the score 0.89 between “has part” and “part of”. But after message passing without mean, their similarity sharply decrease to 0.38 and we can see the instance of “notable work” attends some distinct but class-irrelated words from lower part of figure 5, such as ‘crime’,‘by’. After message passing with mean, the similarity is slowly decline to 0.65 and we can see it enhances class-related words ‘novel’ and avoids introducing noise. All results show that the best performance is achieved by message passing with mean.

6 Conclusion

In this paper, we have presented FAEA, a framework that can effectively handle few-shot inverse relations by enhancing related function words importance. Experiments demonstrate that FAEA achieves new sota results on two NLP tasks on FewRel dataset. In future work, we will try to design a more effective and general function-words enhanced backbone network for various NLP tasks.

References

  • [Bao et al.2020] Yujia Bao, Menghua Wu, Shiyu Chang, and Regina Barzilay. Few-shot text classification with distributional signatures. In ICLR, Addis Ababa, Ethiopia, April 2020. OpenReview.net.
  • [Devlin et al.2019] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, pages 4171–4186, Minneapolis, MN, USA, June 2019. Association for Computational Linguistics.
  • [Finn et al.2017] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, pages 1126–1135, Sydney, NSW, Australia, August 2017. PMLR.
  • [Gao et al.2019a] Tianyu Gao, Xu Han, Zhiyuan Liu, and Maosong Sun. Hybrid attention-based prototypical networks for noisy few-shot relation classification. In AAAI, pages 6407–6414, Honolulu, Hawaii, USA, January–February 2019. AAAI Press.
  • [Gao et al.2019b] Tianyu Gao, Xu Han, Hao Zhu, Zhiyuan Liu, Peng Li, Maosong Sun, and Jie Zhou. Fewrel 2.0: Towards more challenging few-shot relation classification. In EMNLP-IJCNLP, pages 6249–6254, Hong Kong, China, November 2019. Association for Computational Linguistics.
  • [Han et al.2018] Xu Han, Hao Zhu, Pengfei Yu, Ziyun Wang, Yuan Yao, Zhiyuan Liu, and Maosong Sun. Fewrel: A large-scale supervised few-shot relation classification dataset with state-of-the-art evaluation. In EMNLP, pages 4803–4809, Brussels, Belgium, October–November 2018. Association for Computational Linguistics.
  • [Ma et al.2021] Kaixin Ma, Filip Ilievski, Jonathan Francis, Yonatan Bisk, Eric Nyberg, and Alessandro Oltramari. Knowledge-driven data construction for zero-shot evaluation in commonsense question answering. In AAAI, pages 13507–13515, Online Event, February 2021. AAAI Press.
  • [Peng et al.2020] Hao Peng, Tianyu Gao, Xu Han, Yankai Lin, Peng Li, Zhiyuan Liu, Maosong Sun, and Jie Zhou. Learning from context or names? an empirical study on neural relation extraction. In EMNLP, pages 3661–3672, Online, November 2020. Association for Computational Linguistics.
  • [Qu et al.2020] Meng Qu, Tianyu Gao, Louis-Pascal A. C. Xhonneux, and Jian Tang. Few-shot relation extraction via bayesian meta-learning on relation graphs. In ICML, pages 7867–7876, Virtual Event, July 2020. PMLR.
  • [Satorras and Estrach2018] Victor Garcia Satorras and Joan Bruna Estrach.

    Few-shot learning with graph neural networks.

    In ICLR, Vancouver, BC, Canada, April–May 2018. OpenReview.net.
  • [Snell et al.2017] Jake Snell, Kevin Swersky, and Richard S. Zemel. Prototypical networks for few-shot learning. In NIPS, pages 4077–4087, Long Beach, CA, USA, December 2017. Advances in Neural Information Processing Systems.
  • [Soares et al.2019] Livio Baldini Soares, Nicholas FitzGerald, Jeffrey Ling, and Tom Kwiatkowski. Matching the blanks: Distributional similarity for relation learning. In ACL, pages 2895–2905, Florence, Italy, July–August 2019. Association for Computational Linguistics.
  • [Sun et al.2019] Shengli Sun, Qingfeng Sun, Kevin Zhou, and Tengchao Lv. Hierarchical attention prototypical networks for few-shot text classification. In EMNLP-IJCNLP, pages 476–485, Hong Kong, China, November 2019. Association for Computational Linguistics.
  • [Wang et al.2020] Yuxia Wang, Karin Verspoor, and Timothy Baldwin. Learning from unlabelled data for clinical semantic textual similarity. In ClinicalNLP, pages 227–233, Online, November 2020. Association for Computational Linguistics.
  • [Wen et al.2021] Wen Wen, Yongbin Liu, Chunping Ouyang, Qiang Lin, and Tong Lee Chung. Enhanced prototypical network for few-shot relation extraction. Inf. Process. Manag., 58(4):102596, April 2021.
  • [Wu et al.2021] Aming Wu, Yahong Han, Linchao Zhu, and Yi Yang. Universal-prototype enhancing for few-shot object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9567–9576, 2021.
  • [Yang et al.2021a] Fengyuan Yang, Ruiping Wang, and Xilin Chen. SEGA: semantic guided attention on visual prototype for few-shot learning. CoRR, abs/2111.04316, November 2021.
  • [Yang et al.2021b] Shan Yang, Yongfei Zhang, Guanglin Niu, Qinghua Zhao, and Shiliang Pu. Entity concept-enhanced few-shot relation extraction. In ACL/IJCNLP, pages 987–991, Virtual Event, August 2021. Association for Computational Linguistics.
  • [Ye and Ling2019] Zhixiu Ye and Zhenhua Ling. Multi-level matching and aggregation network for few-shot relation classification. In ACL, pages 2872–2881, Florence, Italy, July–August 2019. Association for Computational Linguistics.
  • [Zhang et al.2020] Ji Zhang, Chengyao Chen, Pengfei Liu, Chao He, and Cane Wing-Ki Leung.

    Target-guided structured attention network for target-dependent sentiment analysis.

    Trans. Assoc. Comput. Linguistics, 8:172–182, 2020.