Ensemble Making Few-Shot Learning Stronger

05/12/2021
by   Qing Lin, et al.
0

Few-shot learning has been proposed and rapidly emerging as a viable means for completing various tasks. Many few-shot models have been widely used for relation learning tasks. However, each of these models has a shortage of capturing a certain aspect of semantic features, for example, CNN on long-range dependencies part, Transformer on local features. It is difficult for a single model to adapt to various relation learning, which results in the high variance problem. Ensemble strategy could be competitive on improving the accuracy of few-shot relation extraction and mitigating high variance risks. This paper explores an ensemble approach to reduce the variance and introduces fine-tuning and feature attention strategies to calibrate relation-level features. Results on several few-shot relation learning tasks show that our model significantly outperforms the previous state-of-the-art models.

READ FULL TEXT VIEW PDF
11/13/2020

Zero-shot Learning for Relation Extraction

Most existing supervised and few-shot learning relation extraction metho...
03/13/2022

Worst Case Matters for Few-Shot Recognition

Few-shot recognition learns a recognition model with very few (e.g., 1 o...
04/15/2021

A Sample-Based Training Method for Distantly Supervised Relation Extraction with Pre-Trained Transformers

Multiple instance learning (MIL) has become the standard learning paradi...
09/10/2019

PARN: Position-Aware Relation Networks for Few-Shot Learning

Few-shot learning presents a challenge that a classifier must quickly ad...
10/24/2018

FewRel: A Large-Scale Supervised Few-Shot Relation Classification Dataset with State-of-the-Art Evaluation

We present a Few-Shot Relation Classification Dataset (FewRel), consisti...
05/11/2021

Few-Shot Learning by Integrating Spatial and Frequency Representation

Human beings can recognize new objects with only a few labeled examples,...
01/16/2022

From Examples to Rules: Neural Guided Rule Synthesis for Information Extraction

While deep learning approaches to information extraction have had many s...

1 Introduction

Relation CNN Incp GRU Tran CNN Incp GRU Trans Ensemble
ID Ed Ed Ed Ed Cosine Cosine Cosine Cosine
P4552 86.96 88.47 91.29 88.80 88.56 90.21 87.57 87.22 91.87
P706 81.91 82.38 81.95 79.01 79.17 75.97 77.42 78.99 82.28
P176 93.90 94.33 93.40 93.32 92.84 95.09 93.94 92.42 96.12
P102 96.15 97.02 96.72 95.45 94.17 96.32 94.54 95.89 97.75
P674 89.14 89.85 90.54 88.15 89.57 91.71 90.91 87.83 92.33
P101 88.30 88.95 86.88 87.17 84.13 85.00 82.19 84.36 90.17
P2094 97.69 98.16 97.99 98.96 97.87 98.34 97.03 99.30 99.21
P413 96.91 96.95 97.75 96.76 94.74 96.68 95.22 93.05 98.18
P25 97.10 97.60 96.97 96.09 96.22 96.58 95.31 96.92 98.08
P921 82.56 81.78 76.03 80.34 80.91 80.43 76.44 78.11 83.54
Table 1:

Accuracy of each relation of test data in 5-way 5-shot relation classification task using prototypical networks. CNN denotes convolutional neural networks  

(Lecun et al., 1989) encoder, Incp is inception networks  (Szegedy et al., 2015)

encoder, GRU is gated recurrent Unit networks  

(Cho et al., 2014)

encoder, Trans is transformers networks  

Vaswani et al. (2017) encoder, Ensemble is our ensemble approach, Ed is Euclidean distance, Cosine is the cosine distance.

Few-shot learning can reduce the burden of annotated data and quickly generalize to new tasks without training from scratch. The few-shot learning has become an approach of choice in many natural language processing tasks such as entity recognition and relation classification. There have been many few-shot models proposed in relation extraction task, such as siamese neural network  

(Koch et al., 2015), matching network  (Vinyals et al., 2016), relation network  (Sung et al., 2018) and prototypical network  (Snell et al., 2017). Among these models, the prototypical network is a more efficient and naive approach.

However, representations learning and metrics selecting for relation classification are challenging due to the rich forms of relation expressions in natural language, usually containing local and global, complicated correlations between entities. These are also the leading cause of high variance problem  (Dhillon et al., 2020). It is problematic that a single model learns the representation for each relation well. This paper aims to learn robust relation representations and similarity metrics from few-shot relation learning.

We propose an ensemble learning approach. We integrate several high accuracy and diversity neural networks to learn the feature representations from each statement’s semantics, rather than a sole network. Table 1 shows that the four improved prototypical models, which use different neural networks to learn relation and prototype representations respectively, perform on ten relation types where the similarity metric is adopted Euclidean distance and Cosine distance, respectively. We show that the ensemble method performs best on almost all relations, where a collection of diverse representations often serves better together than a single strong one.

In order to further improve the domain adaptation and leverage the prototype feature sparsity, we explore the fine-tuning and feature attention strategies to calibrate prototypical representations. To adjust to the new relations, we make weight updates using support samples. The fine-tuning method was first proposed in image classification field  (Dhillon et al., 2020; Chen et al., 2020; Wei-Yu et al., 2019)

. We use the cross-entropy loss function to adjust the weights trained from scratch on an annotated corpus in our fine-tuning strategy. This strategy can significantly improve the accuracy of domain adaptation, especially in cross domain few-shot relation extraction. In order to better learn the prototypical representation of each relation, we further propose feature attention to alleviate the problem of prototype feature sparsity. The attention mechanism can enhance the classification performance and convergence speed.

We conduct experiments on the FewRel 1.0 dataset Han et al. (2018) which comes from wiki(without Domain Adaptation). Then, to show the effect of applying the trained model to other domains, that is, the testing data domain is different from training (with Domain Adaptation); therefore, we choose a new test set PubMed, that comes from a database of biomedical literature and is annotated by FewRel 2.0 Gao et al. (2019b). Experimental results demonstrate that our ensemble prototypical network significantly outperforms other baseline methods.

2 Related work

Figure 1: Architecture of our proposed ensemble model.

In this section, we discuss the related work on few-shot learning.

Parameters Optimization Learning: In 2017, a meta network based on gradient optimization has been proposed, which aims to utilize the learned knowledge and then rapidly generalize to new tasks (Munkhdalai and Yu, 2017; Ravi and Larochelle, 2017). Meta network entirely relies on a base learner and a meta learner. The base learner gains the meta-information, which includes the information of the input task during dynamic representation learning, and then adaptively updates the parameters for meta learner, while the meta learner memorizes the parameter and acquires the knowledge across different tasks Vanschoren (2018); Elsken et al. (2020). The agnostic method has been proposed by  Finn et al. (2017). The idea of MAML(Model Agnostic Meta Learning) approach is to learn an initial condition (set of initialization parameter) that is good for fine-tuning on few-shot problems. The few-shot optimization approach  (Ravi and Larochelle, 2017) goes further in meta-learning, not only depends on a good initial condition but also utilizes an LSTM based optimizer to help fine-tuning. And then Bayesian Model Agnostic Meta-Learning Yoon et al. (2018); Qu et al. (2020) combines scalable gradient-based meta-learning with nonparametric variational inference in a principled probabilistic framework.

Metric Based Few-Shot Learning: Siamese neural network was applied to few-shot classification by  Koch et al. (2015), and it utilized a convolutional architecture to rank the similarity between inputs naturally. Then, matching network  Vinyals et al. (2016) was proposed in 2017. It used some external memories to enhance the neural networks. It added an attention mechanism and a new method named cosine distance as the metric of similarity to predict the relations. MLMAN model  Ye and Ling (2019) goes further improving the matching network. In 2018,  Sung et al. (2018)

proposed a relation network for few-shot learning. The relation network learns an embedding and a deep non-linear distance metric for comparing query and sample items. Moreover, the Euclidean distance empirically outperforms the more commonly used cosine similarity on multi-tasks. Thus, a simpler and more efficient model prototypical network was proposed by  

Snell et al. (2017). The naive approach used a standard Euclidean distance as the distance function. In 2019,  Gao et al. (2019a) introduced a hybrid attention-based prototypical network, which is a more efficient prototypical network, and trained a weight matrix for Euclidean distance. These models depended on CNNs, RNNs, and Transformers Vaswani et al. (2017) as the feature extractors. There are always some limitations for a single network to acquiring semantic features.

Fine-tuning Methods: The idea of fine-tuning is used in few-shot learning, which refers to the pre-training model  (Devlin et al., 2018a; Radford, 2018; Peters et al., 2018). The fine-tuning deep network is a strong baseline for few-shot learning  (Chen et al., 2020; Wei-Yu et al., 2019). These works connect the softmax cross-entropy loss with cosine distance. In 2020,  Dhillon et al. (2020) introduced a transductive fine-tuning baseline for few-shot learning. Most of these works have been developed in the image domain but are not widely used in the relation extraction domain. Different from images, the text is more diverse and complicated. So we demand to apply them to few-shot relation learning tasks.

In the paper, we discuss several factors that affect the robustness of few-shot relation learning. We propose the novel ensemble few-shot learning model that integrates four networks and two metrics into a prototypical network. Further more, our proposed model adopts the fine-tuning to improve the domain adaptation and feature attention strategy to address the problem of feature sparsity.

3 Our Approach

In this section, we give a detailed introduction to the implementation of our ensemble few-shot model, as shown in Figure 1.

3.1 Notations and Definitions

We follow  Gao et al. (2019a) and  Snell et al. (2017) to define our few-shot setting. Few-shot Relation classification(RC) is defined as a task to predict the relation between the entity pair (, ) mentioned in a query instance . Given a relation and a small support set of labeled examples , each is the

-dimensional feature vector of an example and

is the corresponding label.

The prototypical network  (Snell et al., 2017)

assumes that there exists a prototype where points cluster around for each class, and the query point is classified by using the distance function to calculate the nearest class prototype, which is determined by the instances in its support set. Given an instance

= mentioning two entities, we encode the instance into a low-dimensional embedding through an embedding function with learnable parameters , which are different for different neural networks encode layer. In our ensemble model, we adopt four classical neural networks to learn the respectively. The main idea of prototypical networks is to compute a class representation named prototype.

(1)

Given a test point , We can compute a distribution over classes as follow,

(2)

where is the distance function, which can be either Euclidean distance or Cosine distance. In our paper, both metrics are taken into account in our model. Also, we adopt feature-level attention to compute the , which can achieve better results and convergence speed.

3.2 Feature Attention

The original model uses simple Euclidean distance as the distance function. In fact, some dimensions are more discriminative for classifying relations in the feature space  (Gao et al., 2019a). We improve the feature-level attention in the  Gao et al. (2019a), and propose the feature attention based on vector subtraction that can find those dimensions more discriminated.

(3)

where is the number of sample in support set.

(4)

where

is a hyperparameter.

(5)
(6)

where is the score vector for relation computed via (3) and (4). The refers to Euclidean distance and the refers to Cosine distance.

3.3 Learning Ensemble of Deep Networks

To reduce the high variance of few-shot learning, we use ensemble methods acquiring semantic feature, as in Figure 1. Now we discuss the objective functions for learning ensemble prototypes model. During meta-training, each network needs to minimize the cross-entropy loss function over a training dataset:

(7)

where is a basic neural networks, which could be CNN, Inceptions, GRU and Transformer in our experiments. The cost function is the cross-entropy between label and , and the is a weight decay parameter.

In our ensemble model, is the number of ensemble networks. When training each networks independently, , each would performance on (7) separately. Due to choosing basic networks are very different, each network learns the semantic feature will typically differ. This is the reason of ensemble model appealing in our paper.

In our ensemble model, we propose a joint loss function to ensemble each network:

(8)

where is the Entropy on , we aim to seek outputs with a peaked posterior. The loss function would help to solve the collaboration of the ensemble networks during training.

3.4 Fine-tuning Learning

For our model to recognize novel relation classes, especially for the cross domain, we adopt a fine-tuning strategy to improve our model domain adaptation. We use support set to fine-tuning our model parameters, where is the number of available samples of the class . we encode of the

to the representation and feed to a feed-forward classifier (Softmax layer):

(9)

where and denote the learning parameters in the feed-forward layer, and we train it with the cross-entropy loss:

(10)

by (10), we could optimize the parameters of our model. By carefully initializing appropriately the parameter , it is possible to achieve desirable properties of the ensemble  (Dhillon et al., 2020). So we use each class prototype in the to initialize , setting .

4 Experiments

In this section, we present the experimental and implementation details of our methods. First, we compare our proposed ensemble model with existing state-of-the-art models on different levels to show the advantages. Secondly, we further study the effect of different parts that are integrated into our ensemble model. Our ensemble model integrates CNN,Inception, GRU and Transformers networks adopting Euclidean and Cosine distance respectively, as shown in Figure 1.

4.1 Datasets

For various N-way K-shot tasks, we evaluate our proposed our model on two open benchmarks: FewRel 1.0  (Han et al., 2018) and FewRel2.0 (Gao et al., 2019c), which is shown in Table 2.

Source Dataset Source Apply Relation # Instance #
FewRel 1.0 Wiki Training 60 42,000
 (Han et al., 2018) Wiki Validation 10 7,000
Wiki Testing 10 7,000
FewRel 2.0 Wiki Training 64 44,800
 (Gao et al., 2019c) SemEval-2010 task 8 Validation 17 8,851
PubMed Testing 10 2,500
Table 2: Datasets

The FewRel 1.0 dataset has 80 relations and 700 examples for each. Due to the origin test set is hidden, we split the datasets into the training, validation, and test set. To satisfy our experiment setting, we randomly choose 60 relations for the train set, 10 relations for the validation, and the rest 10 relations for testing, and the examples of our test set are disjointed with the validation. However, the train set, validation set, and test set are all come from the Wikipedia corpus; that is, they are in the same domain, which is not practicable enough. Thus, we utilize the data sets in FewRel 2.0, which is cross domain. The train set is from Wiki, which is the same as FewRel 1.0, and it has 64 relations, while the validation set is from SemEval-2010 task 8 with 17 relations, which is annotated on news corpus, and then test on PubMed, which is from the database of biomedical domains and has 10 relations. Moreover, we use accuracy as the evaluation criteria.

4.2 Experimental Setup

Our model hyper-parameters are shown in table 3. We randomly select 20 samples from the data set each time as the query for training and use the GloVe 50-Dimensional word vectors as our initial word embeddings. Our model uses the SGD with the weight decay of

as optimizer during training and fine-tuning. We perform fine-tuning for 60 epochs without any regularization and updates the weight by the cross-entropy using support samples.

Batch Size 4
Query Size 20
Training Iterations 30000
Val Step 2000
Learning Rate 0.1
Weight Decay 10-5
Optimizer SGD
0.1
Fine-tune iterations 60
Table 3: Hyper-parameters Setting

4.3 Baselines

Siamese network  (Koch et al., 2015) maps two samples into the same vector space by using two subnetworks respectively and then calculate the distance between two samples by some distance function. FSL Graph Neural Networks (GNN)  (Garcia and Bruna, 2017) maps all support and query into the vertex in the graph and utilizes the graph neural networks to classify. Prototypical network (Proto)  (Snell et al., 2017) can be classified by measuring the distance between one and all prototypes and selecting the label of the nearest prototype. Snail  (Mishra et al., 2017) is a meta learning model that formalizes meta-learning as a sequence-to-sequence problem, using a combination of temporal convolution (TC) and attention mechanism. A hybrid attention-base prototypical network (Proto_hatt)  (Gao et al., 2019a) is a variant of the prototypical network, which consist of an instance-level attention module and feature-attention module. These are current state-of-the-art few-shot models.

Cross-domain 5 way 10 way
(FewRel2.0) 1 shot 5 shot 10 shot 1 shot 5 shot 10shot
Siamese 39.66 47.72 53.08 27.47 33.58 38.84
GNN 35.95 46.57 52.20 22.73 29.74 -
Proto 40.16 52.62 58.69 28.39 39.38 44.98
Proto_hatt 40.78 56.81 63.72 29.26 43.18 50.36
Bert-pair 56.25* 67.44* - 43.64* 53.17* -
Proto_atten 41.55 55.87 62.28 29.68 42.34 48.63
Ensemble_cosine 42.74 59.12 65.89 30.93 45.51 52.39
Ensemble_edis 44.04 61.41 68.07 32.13 47.88 54.70
Ensemble 44.42 61.76 68.49 32.44 48.26 55.23
Ensemble_fine-tuning / 68.32 74.96 / 63.98 70.61
Table 4: Results on cross-domain (FewRel2.0). *Results reported by  Gao et al. (2019c), / one shot tasks is not suitable for our fine-tuning method.
In-domain 5 way 10 way
(FewRel1.0) 1 shot 5 shot 10 shot 1 shot 5 shot 10shot
Siamese 75.76 85.80 89.04 64.58 77.42 80.30
GNN 71.18 85.71 89.25 56.01 74.33 -
Snail 72.69 84.22 85.23 58.15 68.36 73.36
Proto_hatt 75.45 89.97 92.03 62.64 82.29 85.74
Proto 74.01 89.46 91.55 61.30 81.66 84.87
Ensemble_edis 79.70 92.55 94.10 69.19 86.61 89.02
Ensemble_cosine 81.40 92.56 93.98 71.22 86.72 88.90
Ensemble 81.35 92.90 94.32 71.29 87.30 89.46
Table 5: Results on in-domain(FewRel 1.0). - For our GPU: out of memory

4.4 Experimental Results and Discussion

In this part, we present the comparison results between our proposed model and the typical models of few-shot learning under the same hyper-parameters, which are given in Table 3.

First, we show the results on FewRel 2.0 which is the cross domain in Table 4. As we can see, our ensemble model has over 3% improvement for different scenarios, and the maximum number even reach 20%, when compared with the state-of-the-art model without pre-trained model.

In this experiment, we utilize a feature attention mechanism to focus our ensemble model on the relation-level features. Gave high weights to these dimensions are able to highlight the commonality of the examples in the same relation, which helps to discriminate the similarity and the comparison between the prototype networks with feature attention (Proto_atten) and the prototype networks without feature attention (Proto) on FewRel 2.0 datasets, the results are shown as Figure 2. In addition, the Proto_atten model is a stronger classifier without increasing the number of parameters, which is able to learn feature with more semantics, and then accelerate convergence.

Figure 2: Comparsion between proto_atten and proto.
Figure 3: Radar plots on four Times random split Datasets between the performance of each model and the number of examples in each relation.

In our model, we also use the fine-tuning strategy on k-shot tasks (k1), which is shown on the last line in Table 4. The strategy is able to help our model adapt to the tasks in a new domain, and even the result is higher than Bert-pair, which entirely depend on the pre-trained Bert  (Devlin et al., 2018b).

Above all, our ensemble model has large improvements on cross domain tasks. Since the few-shot methods on cross domain are still immature, and all of the performances are not good. We perform the other experiments and compare the results on the in-domain tasks. The Table 5 is the overall results on in-domain. The Ensemble_cosine refers to the results of ensemble on the Cosine metric, the Ensemble_edis refers to the results of ensemble on the Euclidean metric, and the Ensemble on both.

Focusing on the previous methods in Table 5, we find that the prototypical network, which is the based architecture of Proto  (Snell et al., 2017) and Proto_hatt  (Gao et al., 2019a), achieves the highest accuracy on 5-shot and 10-shot tasks, only except on 1-shot tasks. However, the performance of the Siamese network is surprised on 1-shot tasks when compared with others. To prove our discovery, we randomly redistribute the relations in our training, validation, and testing set three times and then visualize the experiment results in Figure 3.

As Figure 3 shows that, all of the evaluations on the four times random split dataset have a key similarity: the performance of the Proto model and Proto_hatt model drop dramatically when the number of examples in each relation is single, which is hard to distinguish the relation label, and the accuracy is lower than other traditional approaches, such as Siamese network which is more stable with the competition results. We analyse the results and find that both the Proto model and Proto_hatt model depend on the prototype of each relation to predict the relation label. However, in the 1-shot task, the prototype is equal to the only example in each relation; thus, the prototype is completely influenced by noise. Our ensemble model can solve this problem by the cooperation of the component models on 1-shot tasks. On k-shot (k5) tasks, our model inherits the advantage of the prototypical model, and the other component models in the ensemble model help to recognize more diversity relations. Due to the characteristics demonstrate above, our ensemble model can not only improves the performance on all tasks and becomes more stable when the number of examples changes.

Figure 4: Radar plots on 4 times random split datasets to compare the stability of each model
5 way 10 way
Model 1 shot 5 shot 10 shot 1 shot 5 shot 10shot
Cnn_edis 74.31 90.02 92.32 62.03 82.35 85.88
Cnn_cosine 74.63 89.08 91.16 61.63 80.90 84.08
Incep_edis 75.20 90.50 92.70 62.98 83.01 86.53
incep_cosine 77.34 90.538 92.51 65.70 83.40 86.45
GRU_edis 74.12 89.83 92.01 62.42 82.56 85.83
GRU_cosine 76.69 90.09 91.98 65.44 82.82 85.62
Trans_edis 75.29 90.13 92.14 63.73 82.79 85.73
Trans_cosine 75.80 89.71 91.60 64.08 82.35 85.20
Ensemble_edis 79.70 92.55 94.10 69.19 86.61 89.02
Ensemble_cosine 81.40 92.56 93.98 71.22 86.72 88.90
Ensemble 81.35 92.90 94.32 71.29 87.30 89.46
Table 6: Comparison between metrics with individual encoders

To further proves the stability of our ensemble model, we compare the four times experiments horizontally and then map the results of the four experiments to the four dimensions of the radar plots, which is shown in Figure 4. The closer distance between each edge and the corresponding equipotential line(grey), the model is more stable and has a lower variance. As we can see, when compared to the other competitive models, our ensemble model achieves the highest accuracy on all of the four experiments. Moreover, we find that the four connected edges of our model are very close to the equipotential lines, which demonstrates that our ensemble model can keep its advantage, ignoring the influence of different relations. In terms of the specific value, the fluctuation ratio of our ensemble model has over 0.5% decrease, except the Snail network the accuracy of which is disillusionary on all tasks. Moreover, we can find that our ensemble model has more effective on the datasets which is hard to predict. Because the relations are chosen randomly at each time, the above results are sufficient to prove that our method is able to leverage the variance no matter what relations are chosen in training, validation, or testing set.

Above all, our ensemble model is able to improve the performance, reduce the variance, and enhance domain adaption. Our ensemble approach is necessary and effective for improving robustness.

4.5 Ablation Study

In this section, we disassemble our ensemble model to analyze the effectiveness of our approach.

Figure 5: Accuracy of each relation in testing set. The deeper color, the higher accuracy.

Effect of ensemble different metrics: In this experiment, we aim show the effect of Euclidean distance and Cosine distance, which are integrated into our ensemble model as the distant metrics between prototypes and query example. In Table 6, we compare the performance under the encoders of GRU and Transformers without feature attention. The results demonstrate that the model with cosine distance achieves higher accuracy when the number of examples of each relation is less, while the Euclidean distance is more suitable for the larger number of samples.

To further study, we ensemble the models with Euclidean distance and Cosine distance, respectively, and the comparison results are presented in Table 6. We prove that the above research is also effective in the ensemble model. Besides, the cosine distance is more stable when the number of given examples changes.

Effect of different encoders In this part, we permutate and combine each encoder and each metrics, and then experiment under 5-way 5-shot setting. We present the results on individual relations as Figure 5. We can find that different models are good at predicting different parts of relations. For example, the inception-euclidean model is more suitable to recognize P140, P150, P276, P921 relations when compared with the transformer-cosine model. Thus, each component part of our ensemble model has greatly helped to improve the final performance.

5 Conclusion

In this paper, we propose ensemble prototypical networks for improving accuracy and robustness. Our ensembles model consist of eight modules, which are basic neural networks. We adopt fine-tuning for enhancing domain adaption and introduce feature attention which alleviates the problem of feature sparsity. In our experiments, we evaluate our model on FewRel 1.0 and FewRel 2.0, which demonstrate that our model significantly improves the accuracy and robustness, and achieve state-of-the-art results. In the future, we will explore more diverse ensemble schemes and adopt more neural encoders to make our model stronger.

References

  • Chen et al. (2020) Yinbo Chen, Xiaolong Wang, Zhuang Liu, Huijuan Xu, and Trevor Darrell. 2020. A new meta-baseline for few-shot learning. arXiv preprint arXiv:2003.04390.
  • Cho et al. (2014) Kyunghyun Cho, Bart Van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using rnn encoder-decoder for statistical machine translation. Computer Science.
  • Devlin et al. (2018a) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018a. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  • Devlin et al. (2018b) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018b. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805.
  • Dhillon et al. (2020) Guneet S. Dhillon, P. Chaudhari, A. Ravichandran, and Stefano Soatto. 2020. A baseline for few-shot image classification. ArXiv, abs/1909.02729.
  • Elsken et al. (2020) Thomas Elsken, Benedikt Staffler, Jan Hendrik Metzen, and Frank Hutter. 2020. Meta-learning of neural architectures for few-shot learning. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    , pages 12365–12375.
  • Finn et al. (2017) Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In

    Proceedings of the 34th International Conference on Machine Learning-Volume 70

    , pages 1126–1135. JMLR. org.
  • Gao et al. (2019a) Tianyu Gao, Xu Han, Zhiyuan Liu, and Maosong Sun. 2019a. Hybrid attention-based prototypical networks for noisy few-shot relation classification. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    , volume 33, pages 6407–6414.
  • Gao et al. (2019b) Tianyu Gao, Xu Han, Hao Zhu, Zhiyuan Liu, Peng Li, Maosong Sun, and Jie Zhou. 2019b. Fewrel 2.0: Towards more challenging few-shot relation classification. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6251–6256.
  • Gao et al. (2019c) Tianyu Gao, Xu Han, Hao Zhu, Zhiyuan Liu, Peng Li, Maosong Sun, and Jie Zhou. 2019c. FewRel 2.0: Towards more challenging few-shot relation classification. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6250–6255, Hong Kong, China. Association for Computational Linguistics.
  • Garcia and Bruna (2017) Victor Garcia and Joan Bruna. 2017. Few-shot learning with graph neural networks. arXiv preprint arXiv:1711.04043.
  • Han et al. (2018) Xu Han, Hao Zhu, Pengfei Yu, Ziyun Wang, Yuan Yao, Zhiyuan Liu, and Maosong Sun. 2018. Fewrel: A large-scale supervised few-shot relation classification dataset with state-of-the-art evaluation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4803–4809.
  • Koch et al. (2015) Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. 2015. Siamese neural networks for one-shot image recognition. In

    ICML deep learning workshop

    , volume 2. Lille.
  • Lecun et al. (1989) Y Lecun, B Boser, J Denker, D Henderson, R Howard, W Hubbard, and L Jackel. 1989. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4):541–551.
  • Mishra et al. (2017) Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. 2017. A simple neural attentive meta-learner.
  • Munkhdalai and Yu (2017) Tsendsuren Munkhdalai and Hong Yu. 2017. Meta networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2554–2563. JMLR. org.
  • Peters et al. (2018) Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In NAACL-HLT.
  • Qu et al. (2020) Meng Qu, Tianyu Gao, Louis-Pascal Xhonneux, and Jian Tang. 2020. Few-shot relation extraction via bayesian meta-learning on relation graphs. In International Conference on Machine Learning, pages 7867–7876. PMLR.
  • Radford (2018) A. Radford. 2018. Improving language understanding by generative pre-training.
  • Ravi and Larochelle (2017) Sachin Ravi and Hugo Larochelle. 2017. Optimization as a model for few-shot learning.
  • Snell et al. (2017) Jake Snell, Kevin Swersky, and Richard Zemel. 2017. Prototypical networks for few-shot learning. In Advances in neural information processing systems, pages 4077–4087.
  • Sung et al. (2018) Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales. 2018. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1199–1208.
  • Szegedy et al. (2015) Christian Szegedy, W. Liu, Y. Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, D. Erhan, V. Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–9.
  • Vanschoren (2018) Joaquin Vanschoren. 2018. Meta-learning: A survey. arXiv preprint arXiv:1810.03548.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
  • Vinyals et al. (2016) Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. 2016. Matching networks for one shot learning. In Advances in neural information processing systems, pages 3630–3638.
  • Wei-Yu et al. (2019) Chen Wei-Yu, Yen-Cheng Liu, Zsolt Kira, Yu-Chiang Wang, and Jia-Bin Huang. 2019. A closer look at few-shot classification. In International Conference on Learning Representations.
  • Ye and Ling (2019) Zhi-Xiu Ye and Zhen-Hua Ling. 2019. Multi-level matching and aggregation network for few-shot relation classification. arXiv preprint arXiv:1906.06678.
  • Yoon et al. (2018) Jaesik Yoon, Taesup Kim, Ousmane Dia, Sungwoong Kim, Yoshua Bengio, and Sungjin Ahn. 2018. Bayesian model-agnostic meta-learning. Advances in Neural Information Processing Systems, 31:7332–7342.

Appendices

5.1 Open Relation Extraction Datasets

5.2 Computing Infrastructure

Computing infrastructure: GPU Tesla V100

5.3 Encoders Description

Encoder module

Given an sample , represented by a sequence of word embeddings, we use different neural architectures as sentence encoders to get a continuous low-dimensional sample embedding . We denote the encoder operation as the following equation.

(11)

Where represents the neural network architectures used by encoder operation.

CNN encoder: In this encoder module, we select CNN to encode into an embedding . In the CNN, convolution and pooling operation are successively applied to capture the text semantics and get sample embedding

(12)

Where convolution operation uses a convolution kernel to slide over the word embedding to get hidden embeddings, and pooling operation

uses Max pooling to output the final samples embedding

. We simplify the above operation to the following equation:

(13)

Inception encoder:

Referring to the GoogLeNet, we design a inception module with wider convolution layer as the encoder, which uses multiple parallel convolution kernels with different window size

to encode sample to get hidden embeddings.

(14)

Where means to use a convolution kernel with the window size i for convolution operation. In order to reduce the computational complexity of convolution operation with size convolution kernel, we decompose it into two size convolution kernels.

The features obtained from different scale convolution operations are fused to get the final hidden embeddings

(15)

Here means to concatenate all embeddings () into a higher-dimensional embeddings .

Finally, we get the sample embedding by applying a pooling operation on hidden embeddings

(16)

We demote the above operation as the following equation:

(17)

Base-attention GRU encoder:

in this encoder, we use a bi-direction recurrent neural network with self-attention to process the sample

. The encoder consists of a bidirectional GRU layer and a self-attention layer.

GRU layer uses a parameter-shared gru cell to process the samples to get the hidden embeddings based on the current input state and the previous output state

(18)

we concatenate and to get hidden embeddings and denote all h as the , and then we obtain final sample embedding by linear combination of hidden embeddings in . The self-attention layer is used to compute the linear combination, which takes the hidden embeddings in as input and computes the weight vector a

(19)

Where is a linear layer,

is an activation function. the final representation

of samples is the sum of

(20)

We demote the above operation as the following equation:

(21)

Base-attention Transformer encoder: Transformer is a parallel computing neural network structure composed of attention mechanism, which consists of two part: Encoder and Decoder. In this encoder, we use Transformer Encoder as our sentence encoder.

(22)

Transformer encoder takes the word embeddings of samples x as the input, and output the hidden embeddings . Then we append a self-attention layer like in GRU to compute the linear combination of hidden embedding in H and get the final sample embedding .

(23)

We demote the above operation as the following equation:

(24)

5.4 Other Experimental Results

Experiment result on the cross-domain dataset Fewrel 2.0

5 way 10 way
Model 1 shot 5 shot 10 shot 1 shot 5 shot 10shot
Cnn_edis 41.54 55.87 62.28 29.69 42.33 48.62
Cnn_cosine 40.57 51.95 57.45 28.56 38.13 43.05
Incep_edis 39.70 55.11 61.81 27.82 41.57 48.10
incep_cosine 39.70 53.69 60.07 27.93 40.14 46.16
GRU_edis 40.41 53.73 58.99 28.28 40.07 45.05
GRU_cosine 39.63 52.55 57.68 27.79 39.01 43.99
Trans_edis 41.15 54.98 60.43 29.14 41.40 46.89
Trans_cosine 40.16 53.61 59.22 28.24 40.19 45.65
Ensemble 44.42 61.76 68.49 32.44 48.26 55.23
Table 7: the accuracy of each sub-model in our ensemble method on the cross-domain dataset Fewrel2.0
5 way 10 way
Model 1 shot 5 shot 10 shot 1 shot 5 shot 10shot
Ensemble_cosine 42.67 57.99 64.45 30.55 44.06 50.43
Ensemble_edis 43.24 59.70 66.20 31.13 45.95 52.58
Ensemble 44.39 61.27 67.89 32.20 47.61 54.41
Table 8: The result of our ensemble model that uses the voting method as the ensemble scheme on the cross-domain dataset
5 way 10 way
Model 1 shot 5 shot 10 shot 1 shot 5 shot 10shot
Cnn_edis / 58.55 66.28 / 48.01 56.05
Cnn_cosine / 61.28 68.94 / 53.17 62.94
Incep_edis / 63.10 70.26 / 57.05 65.52
incep_cosine / 60.02 67.91 / 51.36 60.14
GRU_edis / 57.91 63.03 / 49.53 56.32
GRU_cosine / 56.43 61.90 / 47.97 54.36
Trans_edis / 63.35 69.88 / 57.93 63.91
Trans_cosine / 62.93 68.49 / 57.85 65.15
Ensemble_fine-tune / 68.32 74.96 / 63.98 70.61
Table 9: the accuracy of each sub-model after fine-tuning on the cross-domain dataset Fewrel2.0

Comparative experiment in which the model uses Glove word vector and Bert word vector as initial word embedding respectively on the random split in-domain FewRel 1.0. The accuracy of the model fluctuates greatly when Bert word vector was used. Here, So, is only a preliminary comparison.

Model GloVe Bert
Cnn_edis 89.81 92.78
Cnn_cosine 89.28 91.93
Incep_edis 91.01 93.65
incep_cosine 90.62 92.75
GRU_edis 90.49 91.84
GRU_cosine 89.79 90.63
Trans_edis 90.93 92.06
Trans_cosine 90.62 91.712
Ensemble 93.18 94.71
Bert-pair 94.19
Table 10: Comparison between our models that used Glove word vector and Bert word vector as initial word embedding respectively