1 Introduction
Relation  CNN  Incp  GRU  Tran  CNN  Incp  GRU  Trans  Ensemble 

ID  Ed  Ed  Ed  Ed  Cosine  Cosine  Cosine  Cosine  
P4552  86.96  88.47  91.29  88.80  88.56  90.21  87.57  87.22  91.87 
P706  81.91  82.38  81.95  79.01  79.17  75.97  77.42  78.99  82.28 
P176  93.90  94.33  93.40  93.32  92.84  95.09  93.94  92.42  96.12 
P102  96.15  97.02  96.72  95.45  94.17  96.32  94.54  95.89  97.75 
P674  89.14  89.85  90.54  88.15  89.57  91.71  90.91  87.83  92.33 
P101  88.30  88.95  86.88  87.17  84.13  85.00  82.19  84.36  90.17 
P2094  97.69  98.16  97.99  98.96  97.87  98.34  97.03  99.30  99.21 
P413  96.91  96.95  97.75  96.76  94.74  96.68  95.22  93.05  98.18 
P25  97.10  97.60  96.97  96.09  96.22  96.58  95.31  96.92  98.08 
P921  82.56  81.78  76.03  80.34  80.91  80.43  76.44  78.11  83.54 
Accuracy of each relation of test data in 5way 5shot relation classification task using prototypical networks. CNN denotes convolutional neural networks
(Lecun et al., 1989) encoder, Incp is inception networks (Szegedy et al., 2015)encoder, GRU is gated recurrent Unit networks
(Cho et al., 2014)encoder, Trans is transformers networks
Vaswani et al. (2017) encoder, Ensemble is our ensemble approach, Ed is Euclidean distance, Cosine is the cosine distance.Fewshot learning can reduce the burden of annotated data and quickly generalize to new tasks without training from scratch. The fewshot learning has become an approach of choice in many natural language processing tasks such as entity recognition and relation classification. There have been many fewshot models proposed in relation extraction task, such as siamese neural network
(Koch et al., 2015), matching network (Vinyals et al., 2016), relation network (Sung et al., 2018) and prototypical network (Snell et al., 2017). Among these models, the prototypical network is a more efficient and naive approach.However, representations learning and metrics selecting for relation classification are challenging due to the rich forms of relation expressions in natural language, usually containing local and global, complicated correlations between entities. These are also the leading cause of high variance problem (Dhillon et al., 2020). It is problematic that a single model learns the representation for each relation well. This paper aims to learn robust relation representations and similarity metrics from fewshot relation learning.
We propose an ensemble learning approach. We integrate several high accuracy and diversity neural networks to learn the feature representations from each statement’s semantics, rather than a sole network. Table 1 shows that the four improved prototypical models, which use different neural networks to learn relation and prototype representations respectively, perform on ten relation types where the similarity metric is adopted Euclidean distance and Cosine distance, respectively. We show that the ensemble method performs best on almost all relations, where a collection of diverse representations often serves better together than a single strong one.
In order to further improve the domain adaptation and leverage the prototype feature sparsity, we explore the finetuning and feature attention strategies to calibrate prototypical representations. To adjust to the new relations, we make weight updates using support samples. The finetuning method was first proposed in image classification field (Dhillon et al., 2020; Chen et al., 2020; WeiYu et al., 2019)
. We use the crossentropy loss function to adjust the weights trained from scratch on an annotated corpus in our finetuning strategy. This strategy can significantly improve the accuracy of domain adaptation, especially in cross domain fewshot relation extraction. In order to better learn the prototypical representation of each relation, we further propose feature attention to alleviate the problem of prototype feature sparsity. The attention mechanism can enhance the classification performance and convergence speed.
We conduct experiments on the FewRel 1.0 dataset Han et al. (2018) which comes from wiki(without Domain Adaptation). Then, to show the effect of applying the trained model to other domains, that is, the testing data domain is different from training (with Domain Adaptation); therefore, we choose a new test set PubMed, that comes from a database of biomedical literature and is annotated by FewRel 2.0 Gao et al. (2019b). Experimental results demonstrate that our ensemble prototypical network significantly outperforms other baseline methods.
2 Related work
In this section, we discuss the related work on fewshot learning.
Parameters Optimization Learning: In 2017, a meta network based on gradient optimization has been proposed, which aims to utilize the learned knowledge and then rapidly generalize to new tasks (Munkhdalai and Yu, 2017; Ravi and Larochelle, 2017). Meta network entirely relies on a base learner and a meta learner. The base learner gains the metainformation, which includes the information of the input task during dynamic representation learning, and then adaptively updates the parameters for meta learner, while the meta learner memorizes the parameter and acquires the knowledge across different tasks Vanschoren (2018); Elsken et al. (2020). The agnostic method has been proposed by Finn et al. (2017). The idea of MAML(Model Agnostic Meta Learning) approach is to learn an initial condition (set of initialization parameter) that is good for finetuning on fewshot problems. The fewshot optimization approach (Ravi and Larochelle, 2017) goes further in metalearning, not only depends on a good initial condition but also utilizes an LSTM based optimizer to help finetuning. And then Bayesian Model Agnostic MetaLearning Yoon et al. (2018); Qu et al. (2020) combines scalable gradientbased metalearning with nonparametric variational inference in a principled probabilistic framework.
Metric Based FewShot Learning: Siamese neural network was applied to fewshot classification by Koch et al. (2015), and it utilized a convolutional architecture to rank the similarity between inputs naturally. Then, matching network Vinyals et al. (2016) was proposed in 2017. It used some external memories to enhance the neural networks. It added an attention mechanism and a new method named cosine distance as the metric of similarity to predict the relations. MLMAN model Ye and Ling (2019) goes further improving the matching network. In 2018, Sung et al. (2018)
proposed a relation network for fewshot learning. The relation network learns an embedding and a deep nonlinear distance metric for comparing query and sample items. Moreover, the Euclidean distance empirically outperforms the more commonly used cosine similarity on multitasks. Thus, a simpler and more efficient model prototypical network was proposed by
Snell et al. (2017). The naive approach used a standard Euclidean distance as the distance function. In 2019, Gao et al. (2019a) introduced a hybrid attentionbased prototypical network, which is a more efficient prototypical network, and trained a weight matrix for Euclidean distance. These models depended on CNNs, RNNs, and Transformers Vaswani et al. (2017) as the feature extractors. There are always some limitations for a single network to acquiring semantic features.Finetuning Methods: The idea of finetuning is used in fewshot learning, which refers to the pretraining model (Devlin et al., 2018a; Radford, 2018; Peters et al., 2018). The finetuning deep network is a strong baseline for fewshot learning (Chen et al., 2020; WeiYu et al., 2019). These works connect the softmax crossentropy loss with cosine distance. In 2020, Dhillon et al. (2020) introduced a transductive finetuning baseline for fewshot learning. Most of these works have been developed in the image domain but are not widely used in the relation extraction domain. Different from images, the text is more diverse and complicated. So we demand to apply them to fewshot relation learning tasks.
In the paper, we discuss several factors that affect the robustness of fewshot relation learning. We propose the novel ensemble fewshot learning model that integrates four networks and two metrics into a prototypical network. Further more, our proposed model adopts the finetuning to improve the domain adaptation and feature attention strategy to address the problem of feature sparsity.
3 Our Approach
In this section, we give a detailed introduction to the implementation of our ensemble fewshot model, as shown in Figure 1.
3.1 Notations and Definitions
We follow Gao et al. (2019a) and Snell et al. (2017) to define our fewshot setting. Fewshot Relation classification(RC) is defined as a task to predict the relation between the entity pair (, ) mentioned in a query instance . Given a relation and a small support set of labeled examples , each is the
dimensional feature vector of an example and
is the corresponding label.The prototypical network (Snell et al., 2017)
assumes that there exists a prototype where points cluster around for each class, and the query point is classified by using the distance function to calculate the nearest class prototype, which is determined by the instances in its support set. Given an instance
= mentioning two entities, we encode the instance into a lowdimensional embedding through an embedding function with learnable parameters , which are different for different neural networks encode layer. In our ensemble model, we adopt four classical neural networks to learn the respectively. The main idea of prototypical networks is to compute a class representation named prototype.(1) 
Given a test point , We can compute a distribution over classes as follow,
(2) 
where is the distance function, which can be either Euclidean distance or Cosine distance. In our paper, both metrics are taken into account in our model. Also, we adopt featurelevel attention to compute the , which can achieve better results and convergence speed.
3.2 Feature Attention
The original model uses simple Euclidean distance as the distance function. In fact, some dimensions are more discriminative for classifying relations in the feature space (Gao et al., 2019a). We improve the featurelevel attention in the Gao et al. (2019a), and propose the feature attention based on vector subtraction that can find those dimensions more discriminated.
(3) 
where is the number of sample in support set.
(4) 
where
is a hyperparameter.
(5) 
(6) 
3.3 Learning Ensemble of Deep Networks
To reduce the high variance of fewshot learning, we use ensemble methods acquiring semantic feature, as in Figure 1. Now we discuss the objective functions for learning ensemble prototypes model. During metatraining, each network needs to minimize the crossentropy loss function over a training dataset:
(7) 
where is a basic neural networks, which could be CNN, Inceptions, GRU and Transformer in our experiments. The cost function is the crossentropy between label and , and the is a weight decay parameter.
In our ensemble model, is the number of ensemble networks. When training each networks independently, , each would performance on (7) separately. Due to choosing basic networks are very different, each network learns the semantic feature will typically differ. This is the reason of ensemble model appealing in our paper.
In our ensemble model, we propose a joint loss function to ensemble each network:
(8)  
where is the Entropy on , we aim to seek outputs with a peaked posterior. The loss function would help to solve the collaboration of the ensemble networks during training.
3.4 Finetuning Learning
For our model to recognize novel relation classes, especially for the cross domain, we adopt a finetuning strategy to improve our model domain adaptation. We use support set to finetuning our model parameters, where is the number of available samples of the class . we encode of the
to the representation and feed to a feedforward classifier (Softmax layer):
(9) 
where and denote the learning parameters in the feedforward layer, and we train it with the crossentropy loss:
(10) 
4 Experiments
In this section, we present the experimental and implementation details of our methods. First, we compare our proposed ensemble model with existing stateoftheart models on different levels to show the advantages. Secondly, we further study the effect of different parts that are integrated into our ensemble model. Our ensemble model integrates CNN,Inception, GRU and Transformers networks adopting Euclidean and Cosine distance respectively, as shown in Figure 1.
4.1 Datasets
For various Nway Kshot tasks, we evaluate our proposed our model on two open benchmarks: FewRel 1.0 (Han et al., 2018) and FewRel2.0 (Gao et al., 2019c), which is shown in Table 2.
Source Dataset  Source  Apply  Relation #  Instance # 

FewRel 1.0  Wiki  Training  60  42,000 
(Han et al., 2018)  Wiki  Validation  10  7,000 
Wiki  Testing  10  7,000  
FewRel 2.0  Wiki  Training  64  44,800 
(Gao et al., 2019c)  SemEval2010 task 8  Validation  17  8,851 
PubMed  Testing  10  2,500 
The FewRel 1.0 dataset has 80 relations and 700 examples for each. Due to the origin test set is hidden, we split the datasets into the training, validation, and test set. To satisfy our experiment setting, we randomly choose 60 relations for the train set, 10 relations for the validation, and the rest 10 relations for testing, and the examples of our test set are disjointed with the validation. However, the train set, validation set, and test set are all come from the Wikipedia corpus; that is, they are in the same domain, which is not practicable enough. Thus, we utilize the data sets in FewRel 2.0, which is cross domain. The train set is from Wiki, which is the same as FewRel 1.0, and it has 64 relations, while the validation set is from SemEval2010 task 8 with 17 relations, which is annotated on news corpus, and then test on PubMed, which is from the database of biomedical domains and has 10 relations. Moreover, we use accuracy as the evaluation criteria.
4.2 Experimental Setup
Our model hyperparameters are shown in table 3. We randomly select 20 samples from the data set each time as the query for training and use the GloVe 50Dimensional word vectors as our initial word embeddings. Our model uses the SGD with the weight decay of
as optimizer during training and finetuning. We perform finetuning for 60 epochs without any regularization and updates the weight by the crossentropy using support samples.
Batch Size  4 

Query Size  20 
Training Iterations  30000 
Val Step  2000 
Learning Rate  0.1 
Weight Decay  105 
Optimizer  SGD 
0.1  
Finetune iterations  60 
4.3 Baselines
Siamese network (Koch et al., 2015) maps two samples into the same vector space by using two subnetworks respectively and then calculate the distance between two samples by some distance function. FSL Graph Neural Networks (GNN) (Garcia and Bruna, 2017) maps all support and query into the vertex in the graph and utilizes the graph neural networks to classify. Prototypical network (Proto) (Snell et al., 2017) can be classified by measuring the distance between one and all prototypes and selecting the label of the nearest prototype. Snail (Mishra et al., 2017) is a meta learning model that formalizes metalearning as a sequencetosequence problem, using a combination of temporal convolution (TC) and attention mechanism. A hybrid attentionbase prototypical network (Proto_hatt) (Gao et al., 2019a) is a variant of the prototypical network, which consist of an instancelevel attention module and featureattention module. These are current stateoftheart fewshot models.
Crossdomain  5 way  10 way  

(FewRel2.0)  1 shot  5 shot  10 shot  1 shot  5 shot  10shot 
Siamese  39.66  47.72  53.08  27.47  33.58  38.84 
GNN  35.95  46.57  52.20  22.73  29.74   
Proto  40.16  52.62  58.69  28.39  39.38  44.98 
Proto_hatt  40.78  56.81  63.72  29.26  43.18  50.36 
Bertpair  56.25*  67.44*    43.64*  53.17*   
Proto_atten  41.55  55.87  62.28  29.68  42.34  48.63 
Ensemble_cosine  42.74  59.12  65.89  30.93  45.51  52.39 
Ensemble_edis  44.04  61.41  68.07  32.13  47.88  54.70 
Ensemble  44.42  61.76  68.49  32.44  48.26  55.23 
Ensemble_finetuning  /  68.32  74.96  /  63.98  70.61 
Indomain  5 way  10 way  

(FewRel1.0)  1 shot  5 shot  10 shot  1 shot  5 shot  10shot 
Siamese  75.76  85.80  89.04  64.58  77.42  80.30 
GNN  71.18  85.71  89.25  56.01  74.33   
Snail  72.69  84.22  85.23  58.15  68.36  73.36 
Proto_hatt  75.45  89.97  92.03  62.64  82.29  85.74 
Proto  74.01  89.46  91.55  61.30  81.66  84.87 
Ensemble_edis  79.70  92.55  94.10  69.19  86.61  89.02 
Ensemble_cosine  81.40  92.56  93.98  71.22  86.72  88.90 
Ensemble  81.35  92.90  94.32  71.29  87.30  89.46 
4.4 Experimental Results and Discussion
In this part, we present the comparison results between our proposed model and the typical models of fewshot learning under the same hyperparameters, which are given in Table 3.
First, we show the results on FewRel 2.0 which is the cross domain in Table 4. As we can see, our ensemble model has over 3% improvement for different scenarios, and the maximum number even reach 20%, when compared with the stateoftheart model without pretrained model.
In this experiment, we utilize a feature attention mechanism to focus our ensemble model on the relationlevel features. Gave high weights to these dimensions are able to highlight the commonality of the examples in the same relation, which helps to discriminate the similarity and the comparison between the prototype networks with feature attention (Proto_atten) and the prototype networks without feature attention (Proto) on FewRel 2.0 datasets, the results are shown as Figure 2. In addition, the Proto_atten model is a stronger classifier without increasing the number of parameters, which is able to learn feature with more semantics, and then accelerate convergence.
In our model, we also use the finetuning strategy on kshot tasks (k1), which is shown on the last line in Table 4. The strategy is able to help our model adapt to the tasks in a new domain, and even the result is higher than Bertpair, which entirely depend on the pretrained Bert (Devlin et al., 2018b).
Above all, our ensemble model has large improvements on cross domain tasks. Since the fewshot methods on cross domain are still immature, and all of the performances are not good. We perform the other experiments and compare the results on the indomain tasks. The Table 5 is the overall results on indomain. The Ensemble_cosine refers to the results of ensemble on the Cosine metric, the Ensemble_edis refers to the results of ensemble on the Euclidean metric, and the Ensemble on both.
Focusing on the previous methods in Table 5, we find that the prototypical network, which is the based architecture of Proto (Snell et al., 2017) and Proto_hatt (Gao et al., 2019a), achieves the highest accuracy on 5shot and 10shot tasks, only except on 1shot tasks. However, the performance of the Siamese network is surprised on 1shot tasks when compared with others. To prove our discovery, we randomly redistribute the relations in our training, validation, and testing set three times and then visualize the experiment results in Figure 3.
As Figure 3 shows that, all of the evaluations on the four times random split dataset have a key similarity: the performance of the Proto model and Proto_hatt model drop dramatically when the number of examples in each relation is single, which is hard to distinguish the relation label, and the accuracy is lower than other traditional approaches, such as Siamese network which is more stable with the competition results. We analyse the results and find that both the Proto model and Proto_hatt model depend on the prototype of each relation to predict the relation label. However, in the 1shot task, the prototype is equal to the only example in each relation; thus, the prototype is completely influenced by noise. Our ensemble model can solve this problem by the cooperation of the component models on 1shot tasks. On kshot (k5) tasks, our model inherits the advantage of the prototypical model, and the other component models in the ensemble model help to recognize more diversity relations. Due to the characteristics demonstrate above, our ensemble model can not only improves the performance on all tasks and becomes more stable when the number of examples changes.
5 way  10 way  
Model  1 shot  5 shot  10 shot  1 shot  5 shot  10shot 
Cnn_edis  74.31  90.02  92.32  62.03  82.35  85.88 
Cnn_cosine  74.63  89.08  91.16  61.63  80.90  84.08 
Incep_edis  75.20  90.50  92.70  62.98  83.01  86.53 
incep_cosine  77.34  90.538  92.51  65.70  83.40  86.45 
GRU_edis  74.12  89.83  92.01  62.42  82.56  85.83 
GRU_cosine  76.69  90.09  91.98  65.44  82.82  85.62 
Trans_edis  75.29  90.13  92.14  63.73  82.79  85.73 
Trans_cosine  75.80  89.71  91.60  64.08  82.35  85.20 
Ensemble_edis  79.70  92.55  94.10  69.19  86.61  89.02 
Ensemble_cosine  81.40  92.56  93.98  71.22  86.72  88.90 
Ensemble  81.35  92.90  94.32  71.29  87.30  89.46 
To further proves the stability of our ensemble model, we compare the four times experiments horizontally and then map the results of the four experiments to the four dimensions of the radar plots, which is shown in Figure 4. The closer distance between each edge and the corresponding equipotential line(grey), the model is more stable and has a lower variance. As we can see, when compared to the other competitive models, our ensemble model achieves the highest accuracy on all of the four experiments. Moreover, we find that the four connected edges of our model are very close to the equipotential lines, which demonstrates that our ensemble model can keep its advantage, ignoring the influence of different relations. In terms of the specific value, the fluctuation ratio of our ensemble model has over 0.5% decrease, except the Snail network the accuracy of which is disillusionary on all tasks. Moreover, we can find that our ensemble model has more effective on the datasets which is hard to predict. Because the relations are chosen randomly at each time, the above results are sufficient to prove that our method is able to leverage the variance no matter what relations are chosen in training, validation, or testing set.
Above all, our ensemble model is able to improve the performance, reduce the variance, and enhance domain adaption. Our ensemble approach is necessary and effective for improving robustness.
4.5 Ablation Study
In this section, we disassemble our ensemble model to analyze the effectiveness of our approach.
Effect of ensemble different metrics: In this experiment, we aim show the effect of Euclidean distance and Cosine distance, which are integrated into our ensemble model as the distant metrics between prototypes and query example. In Table 6, we compare the performance under the encoders of GRU and Transformers without feature attention. The results demonstrate that the model with cosine distance achieves higher accuracy when the number of examples of each relation is less, while the Euclidean distance is more suitable for the larger number of samples.
To further study, we ensemble the models with Euclidean distance and Cosine distance, respectively, and the comparison results are presented in Table 6. We prove that the above research is also effective in the ensemble model. Besides, the cosine distance is more stable when the number of given examples changes.
Effect of different encoders In this part, we permutate and combine each encoder and each metrics, and then experiment under 5way 5shot setting. We present the results on individual relations as Figure 5. We can find that different models are good at predicting different parts of relations. For example, the inceptioneuclidean model is more suitable to recognize P140, P150, P276, P921 relations when compared with the transformercosine model. Thus, each component part of our ensemble model has greatly helped to improve the final performance.
5 Conclusion
In this paper, we propose ensemble prototypical networks for improving accuracy and robustness. Our ensembles model consist of eight modules, which are basic neural networks. We adopt finetuning for enhancing domain adaption and introduce feature attention which alleviates the problem of feature sparsity. In our experiments, we evaluate our model on FewRel 1.0 and FewRel 2.0, which demonstrate that our model significantly improves the accuracy and robustness, and achieve stateoftheart results. In the future, we will explore more diverse ensemble schemes and adopt more neural encoders to make our model stronger.
References
 Chen et al. (2020) Yinbo Chen, Xiaolong Wang, Zhuang Liu, Huijuan Xu, and Trevor Darrell. 2020. A new metabaseline for fewshot learning. arXiv preprint arXiv:2003.04390.
 Cho et al. (2014) Kyunghyun Cho, Bart Van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using rnn encoderdecoder for statistical machine translation. Computer Science.
 Devlin et al. (2018a) Jacob Devlin, MingWei Chang, Kenton Lee, and Kristina Toutanova. 2018a. Bert: Pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
 Devlin et al. (2018b) Jacob Devlin, MingWei Chang, Kenton Lee, and Kristina Toutanova. 2018b. BERT: pretraining of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805.
 Dhillon et al. (2020) Guneet S. Dhillon, P. Chaudhari, A. Ravichandran, and Stefano Soatto. 2020. A baseline for fewshot image classification. ArXiv, abs/1909.02729.

Elsken et al. (2020)
Thomas Elsken, Benedikt Staffler, Jan Hendrik Metzen, and Frank Hutter. 2020.
Metalearning of neural architectures for fewshot learning.
In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
, pages 12365–12375. 
Finn et al. (2017)
Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017.
Modelagnostic metalearning for fast adaptation of deep networks.
In
Proceedings of the 34th International Conference on Machine LearningVolume 70
, pages 1126–1135. JMLR. org. 
Gao et al. (2019a)
Tianyu Gao, Xu Han, Zhiyuan Liu, and Maosong Sun. 2019a.
Hybrid attentionbased prototypical networks for noisy fewshot
relation classification.
In
Proceedings of the AAAI Conference on Artificial Intelligence
, volume 33, pages 6407–6414.  Gao et al. (2019b) Tianyu Gao, Xu Han, Hao Zhu, Zhiyuan Liu, Peng Li, Maosong Sun, and Jie Zhou. 2019b. Fewrel 2.0: Towards more challenging fewshot relation classification. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pages 6251–6256.
 Gao et al. (2019c) Tianyu Gao, Xu Han, Hao Zhu, Zhiyuan Liu, Peng Li, Maosong Sun, and Jie Zhou. 2019c. FewRel 2.0: Towards more challenging fewshot relation classification. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pages 6250–6255, Hong Kong, China. Association for Computational Linguistics.
 Garcia and Bruna (2017) Victor Garcia and Joan Bruna. 2017. Fewshot learning with graph neural networks. arXiv preprint arXiv:1711.04043.
 Han et al. (2018) Xu Han, Hao Zhu, Pengfei Yu, Ziyun Wang, Yuan Yao, Zhiyuan Liu, and Maosong Sun. 2018. Fewrel: A largescale supervised fewshot relation classification dataset with stateoftheart evaluation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4803–4809.

Koch et al. (2015)
Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. 2015.
Siamese neural networks for oneshot image recognition.
In
ICML deep learning workshop
, volume 2. Lille.  Lecun et al. (1989) Y Lecun, B Boser, J Denker, D Henderson, R Howard, W Hubbard, and L Jackel. 1989. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4):541–551.
 Mishra et al. (2017) Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. 2017. A simple neural attentive metalearner.
 Munkhdalai and Yu (2017) Tsendsuren Munkhdalai and Hong Yu. 2017. Meta networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 2554–2563. JMLR. org.
 Peters et al. (2018) Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In NAACLHLT.
 Qu et al. (2020) Meng Qu, Tianyu Gao, LouisPascal Xhonneux, and Jian Tang. 2020. Fewshot relation extraction via bayesian metalearning on relation graphs. In International Conference on Machine Learning, pages 7867–7876. PMLR.
 Radford (2018) A. Radford. 2018. Improving language understanding by generative pretraining.
 Ravi and Larochelle (2017) Sachin Ravi and Hugo Larochelle. 2017. Optimization as a model for fewshot learning.
 Snell et al. (2017) Jake Snell, Kevin Swersky, and Richard Zemel. 2017. Prototypical networks for fewshot learning. In Advances in neural information processing systems, pages 4077–4087.
 Sung et al. (2018) Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales. 2018. Learning to compare: Relation network for fewshot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1199–1208.
 Szegedy et al. (2015) Christian Szegedy, W. Liu, Y. Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, D. Erhan, V. Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–9.
 Vanschoren (2018) Joaquin Vanschoren. 2018. Metalearning: A survey. arXiv preprint arXiv:1810.03548.
 Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
 Vinyals et al. (2016) Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. 2016. Matching networks for one shot learning. In Advances in neural information processing systems, pages 3630–3638.
 WeiYu et al. (2019) Chen WeiYu, YenCheng Liu, Zsolt Kira, YuChiang Wang, and JiaBin Huang. 2019. A closer look at fewshot classification. In International Conference on Learning Representations.
 Ye and Ling (2019) ZhiXiu Ye and ZhenHua Ling. 2019. Multilevel matching and aggregation network for fewshot relation classification. arXiv preprint arXiv:1906.06678.
 Yoon et al. (2018) Jaesik Yoon, Taesup Kim, Ousmane Dia, Sungwoong Kim, Yoshua Bengio, and Sungjin Ahn. 2018. Bayesian modelagnostic metalearning. Advances in Neural Information Processing Systems, 31:7332–7342.
Appendices
5.1 Open Relation Extraction Datasets
Download: https://github.com/thunlp/FewRel
5.2 Computing Infrastructure
Computing infrastructure: GPU Tesla V100
5.3 Encoders Description
Encoder module
Given an sample , represented by a sequence of word embeddings, we use different neural architectures as sentence encoders to get a continuous lowdimensional sample embedding . We denote the encoder operation as the following equation.
(11) 
Where represents the neural network architectures used by encoder operation.
CNN encoder: In this encoder module, we select CNN to encode into an embedding . In the CNN, convolution and pooling operation are successively applied to capture the text semantics and get sample embedding
(12) 
Where convolution operation uses a convolution kernel to slide over the word embedding to get hidden embeddings, and pooling operation
uses Max pooling to output the final samples embedding
. We simplify the above operation to the following equation:(13) 
Inception encoder:
Referring to the GoogLeNet, we design a inception module with wider convolution layer as the encoder, which uses multiple parallel convolution kernels with different window size
to encode sample to get hidden embeddings.(14) 
Where means to use a convolution kernel with the window size i for convolution operation. In order to reduce the computational complexity of convolution operation with size convolution kernel, we decompose it into two size convolution kernels.
The features obtained from different scale convolution operations are fused to get the final hidden embeddings
(15) 
Here means to concatenate all embeddings () into a higherdimensional embeddings .
Finally, we get the sample embedding by applying a pooling operation on hidden embeddings
(16) 
We demote the above operation as the following equation:
(17) 
Baseattention GRU encoder:
in this encoder, we use a bidirection recurrent neural network with selfattention to process the sample
. The encoder consists of a bidirectional GRU layer and a selfattention layer.GRU layer uses a parametershared gru cell to process the samples to get the hidden embeddings based on the current input state and the previous output state
(18) 
we concatenate and to get hidden embeddings and denote all h as the , and then we obtain final sample embedding by linear combination of hidden embeddings in . The selfattention layer is used to compute the linear combination, which takes the hidden embeddings in as input and computes the weight vector a
(19) 
Where is a linear layer,
is an activation function. the final representation
of samples is the sum of(20) 
We demote the above operation as the following equation:
(21) 
Baseattention Transformer encoder: Transformer is a parallel computing neural network structure composed of attention mechanism, which consists of two part: Encoder and Decoder. In this encoder, we use Transformer Encoder as our sentence encoder.
(22) 
Transformer encoder takes the word embeddings of samples x as the input, and output the hidden embeddings . Then we append a selfattention layer like in GRU to compute the linear combination of hidden embedding in H and get the final sample embedding .
(23) 
We demote the above operation as the following equation:
(24) 
5.4 Other Experimental Results
Experiment result on the crossdomain dataset Fewrel 2.0
5 way  10 way  

Model  1 shot  5 shot  10 shot  1 shot  5 shot  10shot 
Cnn_edis  41.54  55.87  62.28  29.69  42.33  48.62 
Cnn_cosine  40.57  51.95  57.45  28.56  38.13  43.05 
Incep_edis  39.70  55.11  61.81  27.82  41.57  48.10 
incep_cosine  39.70  53.69  60.07  27.93  40.14  46.16 
GRU_edis  40.41  53.73  58.99  28.28  40.07  45.05 
GRU_cosine  39.63  52.55  57.68  27.79  39.01  43.99 
Trans_edis  41.15  54.98  60.43  29.14  41.40  46.89 
Trans_cosine  40.16  53.61  59.22  28.24  40.19  45.65 
Ensemble  44.42  61.76  68.49  32.44  48.26  55.23 
5 way  10 way  

Model  1 shot  5 shot  10 shot  1 shot  5 shot  10shot 
Ensemble_cosine  42.67  57.99  64.45  30.55  44.06  50.43 
Ensemble_edis  43.24  59.70  66.20  31.13  45.95  52.58 
Ensemble  44.39  61.27  67.89  32.20  47.61  54.41 
5 way  10 way  

Model  1 shot  5 shot  10 shot  1 shot  5 shot  10shot 
Cnn_edis  /  58.55  66.28  /  48.01  56.05 
Cnn_cosine  /  61.28  68.94  /  53.17  62.94 
Incep_edis  /  63.10  70.26  /  57.05  65.52 
incep_cosine  /  60.02  67.91  /  51.36  60.14 
GRU_edis  /  57.91  63.03  /  49.53  56.32 
GRU_cosine  /  56.43  61.90  /  47.97  54.36 
Trans_edis  /  63.35  69.88  /  57.93  63.91 
Trans_cosine  /  62.93  68.49  /  57.85  65.15 
Ensemble_finetune  /  68.32  74.96  /  63.98  70.61 
Comparative experiment in which the model uses Glove word vector and Bert word vector as initial word embedding respectively on the random split indomain FewRel 1.0. The accuracy of the model fluctuates greatly when Bert word vector was used. Here, So, is only a preliminary comparison.
Model  GloVe  Bert 

Cnn_edis  89.81  92.78 
Cnn_cosine  89.28  91.93 
Incep_edis  91.01  93.65 
incep_cosine  90.62  92.75 
GRU_edis  90.49  91.84 
GRU_cosine  89.79  90.63 
Trans_edis  90.93  92.06 
Trans_cosine  90.62  91.712 
Ensemble  93.18  94.71 
Bertpair  94.19 