Relation classification (RC) is a fundamental task in natural language processing (NLP), which aims to identify the semantic relation between two entities in text. For example, the instance “[London]is the capital of [the UK]” expresses the relation capital_of between the two entities London and the UK.
|class A: mother|
|instance #1 The Queen Consort [Jetsun Pema] gave birth to a son on 5 February 2016 , [Jigme Namgyel Wangchuck].|
|instance #2 He married the American actress [Cindy Robbins] and was stepfather to her daughter , [Kimberly Beck].|
|instance #3 Edgar married actress [Moyna Macgill] and became the father of [Angela Lansbury].|
|instance #4 In 1845 , [Cemile Sultan] ’s mother , Empress [Düzdidil Kadın], died.|
|instance #5 Bo ’s wife [Gu Kailai] traveled with their son [Bo Guagua] to Britain.|
|class B: member_of …|
|class C: father …|
|class D: sport …|
|class E: voice_type …|
|He was married to [Eva Funck] and they have a son [Gustav] .|
Some conventional relation classification methods (Bethard and Martin, 2007; Zelenko et al., 2002) adopted supervised training and suffered from the lack of large-scale manually labeled data. To address this issue, the distant supervision method (Mintz et al., 2009)
was proposed which annotated training data by heuristically aligning knowledge bases (KBs) and texts. However, the long-tail problem in KBs(Xiong et al., 2018; Han et al., 2018)
still exists and makes it hard to classify the relations with very few training samples.
This paper focuses on the few-shot relation classification task, which was designed to address the long-tail problem. In this task, only few (e.g., 1 or 5) support instances are given for each relation, as shown by an example in Table 1.
The few-shot learning problem has been studied extensively in computer vision (CV) field. Some methods adopt meta-learning architectures(Santoro et al., 2016; Ravi and Larochelle, 2016; Finn et al., 2017; Munkhdalai and Yu, 2017), which learn fast-learning abilities from previous experiences (e.g., training set) and then rapidly generalize to new concepts (e.g., test set). Some other methods use metric learning based networks (Koch et al., 2015; Vinyals et al., 2016; Snell et al., 2017), which learn the distance distributions among classes. A simple and effective metric-based few-shot learning method is prototypical network (Snell et al., 2017). In a prototype network, query and support instances are encoded into an embedding space independently. Then, a prototype vector for each class candidate is derived as the mean of its support instances in the embedding space. Finally, classification is performed by calculating the distances between the embedding vector of the query and all class prototypes. This prototype network method has also been applied to few-shot relation classification recently (Han et al., 2018).
This paper proposes a multi-level matching and aggregation network (MLMAN) for few-shot relation classification. Different from prototypical networks, which represent support sets without dependency on query instances, our proposed MLMAN model encodes each query instance and each support set in an interactive way by considering their matching information at both local and instance levels. At local level, the local context representations of a query instance and a support set are softly matched toward each other following the sentence matching framework (Chen et al., 2017)
. Then, the matched local representations are aggregated into an embedding vector for each query and each support instance using max and average pooling. At instance level, the matching degree between the query instance and each of the support instances is calculated via a multi-layer perceptron (MLP). Taking the matching degrees as weights, the instances in a support set are aggregated to form the class prototype for final classification. All these matching and aggregation layers in the MLMAN model are estimated jointly using training data. Since the representations of the support instances in each class are expected to be close with each other, an auxiliary loss function is further designed to measure the inconsistency among all support representations in each class.
In summary, our contributions in this paper are three-fold. First, a multi-level matching and aggregation network is proposed to encode query instances and class prototypes in an interactive fashion. Second, an auxiliary loss function measuring the consistency among support instances is designed. Third, our method achieves a new state-of-the-art performance on FewRel, a public few-shot relation classification dataset.
2 Related Work
2.1 Relation Classification
Relation classification is to identify the semantic relation between two entities in one sentence. In recently years, neural networks have been widely applied to deal with this task.Zeng et al. (2014)
employed position features and convolutional neural networks (CNNs) to capture the structure and contextual information respectively. Then, a max pooling operation was adopted to determine the most useful features.Wang et al. (2016) proposed multi-level attention CNNs, which captured both entity-specific attention and relation-specific pooling attention in order to better discern patterns in heterogeneous contexts. Zhou et al. (2016)
proposed attention-based bidirectional long short-term memory networks (AttBLSTMs) to capture the most important semantic information in a sentence. All of these methods require a large amount of training data and can’t quickly adapt to a new class that has never been seen.
2.2 Metric Based Few-Shot Learning
In few-shot learning paradigm, a classifier is required to generalize to new classes with only a small number of training samples. The metric based approach aims to learn a set of projection functions that take support and query samples from the target problem and classify them in a feed forward manner. This approach has lower complexity and is easier for implementation than meta-learner based approach (Ravi and Larochelle, 2016; Finn et al., 2017; Santoro et al., 2016; Munkhdalai and Yu, 2017).
Some metric based few-shot learning methods have been developed for computer vision (CV) tasks, and all these methods encoded each support or query image to a vector independently for classification. Koch et al. (2015) proposed a method for learning siamese neural networks, which employed an unique structure to encode both support and query samples respectively and one more layer computing the induced distance metric between the pair. Vinyals et al. (2016) proposed to learn a matching network augmented with attention and external memories. And also, an episode-based training procedure was proposed, which was based on a principle that test and training conditions must match and has been adopted by many following studies. Snell et al. (2017) proposed prototypical networks that learn a metric space in which classification can be performed by computing distances to prototype representations of all classes, and the prototype representation of each class was the mean of all its support samples. Garcia and Bruna (2017) defined a graph neural network architecture to assimilate generic message-passing inference algorithms, which generalized above three models.
Regarding with few-shot relation classification, Han et al. (2018) adopted prototypical networks to build baseline models on the FewRel dataset. Gao et al. (2019) proposed hybrid attention-based prototypical networks to handle noisy training samples in few-shot learning. In this paper, we improve the conventional prototypical networks for few-shot relation classification by encoding the query instance and class prototype interactively through multi-level matching and aggregation.
2.3 Sentence Matching
Sentence matching is essential for many NLP tasks, such as natural language inference (NLI) (Bowman et al., 2015) and response selection (Lowe et al., 2015). Some sentence matching methods mainly rely on sentence encoding (Mueller and Thyagarajan, 2016; Conneau et al., 2017; Chen et al., 2018), which encode a pair sentences independently and then transmit their embeddings into a classifier, such as a neural network, to decide the relationship between them. Some other methods are based on joint models (Chen et al., 2017; Gong et al., 2017; Kim et al., 2018), which use cross-features to represent the local (i.e., word-level and phrase-level) alignments for better performance. In this paper, we follow the joint models to achieve the local matching between a query instance and the support set for a class. The difference between our task and the other sentence matching tasks mentioned above is that, our goal is to match a sentence to a set of sentences, instead of to another sentence (Bowman et al., 2015) or to a sequence of sentences (Lowe et al., 2015).
3 Task Definition
In few-shot relation classification, we are given two datasets, and . Each dataset consists of a set of samples , where is a sentence composed of words and the -th word is , indicate the positions of two entities, and is the relation label of the instance . These two datasets have their own relation label spaces that are disjoint with each other. Under few-shot configuration, is splited into two parts, and . If contains labeled samples for each of relation classes, this target few-shot problem is named -way--shot. contains test samples, each labeled with one of the classes. Assuming that we only have and , we can train a model using and evaluate its performance on . But limited by the number of support samples (i.e,., ), it is hard to train a good model from scratch.
Although and have disjoint relation label spaces, can also been utilized to help the few-shot relation classification on . One approach is the paradigm proposed by Vinyals et al. (2016)
, which obey an important machine learning principle that test and train conditions must match. That’s to say, we also splitinto two parts, and , and mimic the few-shot learning settings at training stage. In each training iteration, classes are randomly selected from , and support instances are randomly selected from each class. In this way, we construct the train-support set , where is the k-th instance in class . And also, we randomly select samples from the remaining samples of those classes and construct the train-query set , where is the label of instance .
Just like conventional prototypical networks, we expect to minimize the following objective function at training time
and is defined as
The function is to calculate the matching degree between the query instance and the set of support instances . How to design this function is the focus of this paper.
In this section, we will introduce our proposed multi-level matching and aggregation network (MLMAN) for modeling . For simplicity, we will discard the superscript of from Section 4.1 to Section 4.4. The framework of our proposed MLMAN model is shown in Fig. 1, which has four main modules.
Context Encoder. Given a sentence and the positions of two entities within this sentence, CNNs (Zeng et al., 2014) are adopted to derive the local context representations of each word in the sentence.
Local Matching and Aggregation. Similar to (Chen et al., 2017), given the local representation of a query instance and the local representations of support instances, the attention method is employed to collect local matching information between them. Then, the matched local representations are aggregated to represent each instance as an embedding vector.
Instance Matching and Aggregation. The matching information between a query instance and each of the support instances are calculated using an MLP. Then, we take the matching degrees as weights to sum the representations of support instances in order to get the class prototype.
Class Matching. An MLP is built to calculate the matching score between the representations of the query instance and the class prototype.
More details of these four modules will be introduced in the following subsections.
4.1 Context Encoder
For a query or support instance, each word in the sentence is first mapped into a -dimensional word embedding (Pennington et al., 2014). In order to describe the position information of the two entities in this instance, the position features (PFs) proposed by Zeng et al. (2014) are also adopted in our work. Here, PFs describe the relative distances between current word and the two entities, and are further mapped into two vectors and of dimensions. Finally, these three vectors are concatenated to get the word representation of dimensions, and the instance can be written as .
The most popular models for local context encoding are recurrent neural networks (RNNs) with long short-term memories (LSTMs)(Hochreiter and Schmidhuber, 1997) and convolutional neural networks (CNNs) (Kim, 2014). In this paper, we employ CNNs to build the context encoder. For an input instance , we input it into a CNN with filters. The output from the CNN is a matrix with dimensions. In this way, the context representations of the query instance and the context representations of support instances are obtained, where and are the sentence lengths of the query sentence and the -th support sentence respectively.
4.2 Local Matching and Aggregation
In order to get the matching information between and , we first concatenate the support instance representations into one matrix as follow
where with . Then, we collect the matching information between and and calculate their matched representations and as follows
Next, the original representations and the matched representations are fused utilizing a ReLU layer as follows,
where is the element-wise product and is the weight matrix at this layer for reducing dimensionality. is further split into representations corresponding to the support instances where . All and are fed into a single-layer Bi-directional LSTM (BLSTM) with hidden units along each direction to obtain the final local matching results and .
Local aggregation aims to convert the results of local matching into a single vector for each query and each support instance. In this paper, we employ a max pooling together with an average pooling, and concatenate their results into one vector or . The calculations are as follows,
4.3 Instance Matching and Aggregation
Similar to conventional prototypical networks Snell et al. (2017), our proposed method calculates class prototype via the representations of all support instances in this class, i.e., . However, instead of using a naive mean operation, we aggregate instance-level representations via attention over , where each weight is derived from the instance matching score between and . The matching function is as follow,
where and . describes the instance-level matching degree between the query instance and the support instance . Then, all are aggregated into one vector as
and is the class prototype.
4.4 Class Matching
After the class prototype and the embedding vector of the query instance have been determined, the class-level matching function in Eq. (2) is defined as
Eq. (11) and (13) have the same form. In our experiments, sharing the weights and in these two equations, i.e., employing the exactly same function for both instance-level and class-level matching in each training iteration, lead to better performance.
4.5 Joint Training with Inconsistency Measurement
If the representations of all support instances in a class are far away from each other, it could become difficult for the derived class prototype to capture the common characteristics of all support instances. Therefore, a function which measures the inconsistency among the set of support instances is designed. In order to avoid the high complexity of directly comparing every two support instances in a class, we calculate the inconsistency measurement as the average Euclidean distance between the support instances and the class prototype as
where is the class index and calculates the 2-norm of a vector.
where is a hyper-parameter and was set as 1 in our experiments without any tuning.
5.1 Dataset and Evaluation Metrics
The few-shot relation classification dataset FewRel222https://thunlp.github.io/fewrel.html. was adopted in our experiments. This dataset was first generated by distant supervision and then filtered by crowdsourcing to remove noisy annotations. The final FewRel dataset consists of 100 relations, each has 700 instances. The average number of tokens in each sentence is 24.99, and there are 124,577 unique tokens in total. The 100 relations are split into 64, 16 and 20 for training, validation and test respectively.
Our experiments investigated four few-shot learning configurations, 5 way 1 shot, 5 way 5 shot, 10 way 1 shot, and 10 way 5 shot, which were the same as Han et al. (2018). According to the official evaluation scripts333https://thunlp.github.io/fewrel.html.
, all results given by our experiments were the mean and standard deviation values of 10 training repetitions, and were tested using 20,000 independent samples.
5.2 Training Details and Hyperparameters
All of the hyperparameters used in our experiments are listed in Table3. The 50-dimensional Glove word embeddings released by Pennington et al. (2014) 444https://nlp.stanford.edu/projects/glove/. were adopted in the context encoder and were fixed during training. For the unknown words, we just replaced them with an unique special token UNK and fixed its embedding as a zero vector. Previous study (Munkhdalai and Yu, 2017) found that the models trained on harder tasks may achieve better performances than using the same configurations at both training and test stages. Therefore, we set to construct the train-support sets for 5-way and 10-way tasks. In our experiments, grid searches among , and
were conducted to determine their optimal values. For optimization, we employed mini-batch stochastic gradient descent (SGD) with the initial learning rate of 0.1. The learning rate was decayed to one tenth every 20,000 steps. And also, dropout layersHinton et al. (2012) were inserted before CNN and LSTM layers and the drop rate was set as 0.2.
|position feature||max relative distance||40|
|unidirectional LSTM||hidden size||100|
|size of query set||5|
|Model||No.||5 Way 1 Shot||5 Way 5 Shot||10 Way 1 Shot||10 Way 5 Shot|
5.3 Comparison with Previous Work
Table 2 shows the results of different models tested on FewRel test set. The results of the first four models, Meta Network Munkhdalai and Yu (2017), GNN Garcia and Bruna (2017), SNAIL Mishra et al. (2018), Prorotypical Network Snell et al. (2017), were reported by Han et al. (2018). These models were initially proposed for image classification. Han et al. (2018) just replaced their image encoding module with an instance encoding module and kept other modules unchanged. Proto-HATT (Gao et al., 2019) added hybrid attention mechanism to prototypical networks, mainly focusing on improving the performance on few-shot relation classification with . From Table 2, we can see that our proposed MLMAN model outperforms all other models by a large margin, which shows the effectiveness of considering the interactions between query instance and support set at multiple levels.
5.4 Ablation Study
In order to evaluate the contributions of individual model components, ablation studies were conducted. Table 4 shows the performance of our model and its ablations on the development set of FewRel. Considering that the first 6 ablations only affected the few-shot learning tasks with , model 2 to model 7 achieved exactly the same performance as the complete model (i.e., model 1) under 5 way 1 shot and 10 way 1 shot configurations.
5.4.1 Instance Matching and Aggregation
First, the attention-based instance aggregation introduced in Section 4.3 was replaced with a max pooling (model 4) or an average pooling (model 5). We can see that the model with instance-level attentive aggregation (model 1) outperformed the ones using a max pooling (model 4) or an average pooling (model 5) on 5-shot tasks. Their difference were significantly at 1% significance level in t-test. The advantage of attentive pooling is that the weights of integrating all support instances can be determined dynamically according to the query. For example, when conducting instance matching and aggregation between the query instance and the support set in Table1, the weights of the 5 instances in class A were 0.03, 0.46, 0.25, 0.08 and 0.18 respectively. Instance #2 achieved the highest weight because it had the best similarity with the query instance and was considered as the most helpful one when matching the query instance with class A.
Then, the effectiveness of sharing the weight parameters in Eqs. (11) and (13) was evaluated by untying them (model 3). The performance of model 3 was much worse than the complete model (model 1) as shown in Table 4, which demonstrates the need of sharing the weights for calculating matching scores at both instance and class levels.
5.4.2 Inconsistency Measurement
As introduced in Section 4.5, is designed to measure the inconsistency among the representations of all support instances in a class. After removing , model 2 was optimized only using the objective function . We can see that it performed much worse than the complete model. Furthermore, we calculated the mean of the Euclidean distances between every support instance pair in the same class using model 1 and model 2 respectively. For each support set, the calculation can be written as
We sampled 20,000 support sets under the 5-way 5-shot configuration and calculated the mean of them. The results were and for model 1 and model 2 respectively, which means that was effective at forcing the representations of the support instances in the same class to be close with each other.
was further removed from model 5 and model 6 was obtained. It can be found that the accuracy degradation from model 5 to model 6 was larger than the one from model 1 to model 2. This implies that the objective function also benefited from the attentive aggregation over support instances.
5.4.3 Local Matching
First, the concatenation operation in local matching was removed from model 6 in this ablation study. That’s to say, instead of concatenating the representations of all support instances into one single matrix as Eq. (3), local matching was conducted between the query instance and each support instance separately to get their vector representations (model 7). It should be noticed that this led to different representations of a query instance according to each support class. Then, the mean over for and were calculated to get the representations of the support set and the query instance . Comparing model 6 and model 7, we can see that the concatenation operation plays an important role in our model. One possible reason is that the concatenation operation can help local matching to restrain the support instances with low similarity to the query.
Second, the whole local matching module together with the concatenation and attentive aggregation operation were removed from model 6, which led to model 9. Model 9 is similar to the one proposed by Snell et al. (2017) that encoded the support and query instances independently. The difference was that model 9 was equipped with more components, including an LSTM layer, two pooling operations, and a learnable class matching function. Comparing the performance of model 6 and model 9 in Table 4, we can see that the local matching operation significantly improves the performance in few-shot relation classification. Fig. 2 shows the attention weight matrix calculated between the query instance and the support instance #2 of class A in Table 1. From this figure, we can see that the attention-based local matching is able to capture some matching relations of local contexts, such as the head entities Eva Funck and Cindy Robbins, the tail entities Gustav and Kimberly Beck, the key phrases son and daughter, the same keyword “married”, and so on.
5.4.4 Class Matching
In this experiment, we compared two class matching functions, (1) Euclidean distance (ED) Snell et al. (2017) and (2) a learnable MLP function as shown by Eq. (13). In order to ignore the influence of the instance-level attentive aggregation, these two matching functions were compared based on model 6 and model 9. After converting the MLP function in model 6 and model 9 to Euclidean distance, model 8 and model 10 were obtained. Comparing the performance of these models in Table 4, we have two findings. (1) When local matching was adopted, the learnable MLP for class matching (model 6) outperformed the ED metric (model 8) by a large margin. (2) After removing local matching, the learnable MLP for class matching (model 9) performed not as good as the ED metric (model 10). One possible reason is that the local matching process enhances the interaction between a query instance and a support set when calculating and . Thus, simple Euclidean distance between them may not be able to describe the complex correlation and dependency between them. On the other hand, MLP mapping is more powerful than calculating Euclidean distance, and can be more appropriate for class matching when local matching is also adopted.
In this paper, a neural network with multi-level matching and aggregation has been proposed for few-shot relation classification. First, the query and support instances are encoded interactively via local matching and aggregation. Then, the support instances in a class are further aggregated to form the class prototype and the weights are calculated by attention-based instance matching. Finally, a learnable MLP matching function is employed to calculate the class matching score between the query instance and each candidate class. Furthermore, an additional objective function is designed to improve the consistency among the vector representations of all support instances in a class. Experiments have demonstrated the effectiveness of our proposed model, which achieves state-of-the-art performance on the FewRel dataset. Studying few-shot relation classification with data generated by distant supervision and extending our MLMAN model to zero-shot learning will be the tasks of our future work.
We thank the anonymous reviewers for their valuable comments. This work was partially funded by the National Nature Science Foundation of China (Grant No. U1636201, 61871358).
- Bethard and Martin (2007) Steven Bethard and James H. Martin. 2007. Cu-tmp: Temporal relation classification using syntactic and semantic features. In Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007), pages 129–132. Association for Computational Linguistics.
- Bowman et al. (2015) Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642. Association for Computational Linguistics.
- Chen et al. (2018) Qian Chen, Zhen-Hua Ling, and Xiaodan Zhu. 2018. Enhancing sentence embedding with generalized pooling. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1815–1826. Association for Computational Linguistics.
- Chen et al. (2017) Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, Hui Jiang, and Diana Inkpen. 2017. Enhanced lstm for natural language inference. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1657–1668. Association for Computational Linguistics.
- Conneau et al. (2017) Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc Barrault, and Antoine Bordes. 2017. Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 670–680. Association for Computational Linguistics.
- Finn et al. (2017) Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400.
- Gao et al. (2019) Tianyu Gao, Xu Han, Zhiyuan Liu, and Maosong Sun. 2019. Hybrid attention-based prototypical networks for noisy few-shot relation classification.
- Garcia and Bruna (2017) Victor Garcia and Joan Bruna. 2017. Few-shot learning with graph neural networks. arXiv preprint arXiv:1711.04043.
- Gong et al. (2018) Jingjing Gong, Xipeng Qiu, Xinchi Chen, Dong Liang, and Xuanjing Huang. 2018. Convolutional interaction network for natural language inference. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1576–1585. Association for Computational Linguistics.
- Gong et al. (2017) Yichen Gong, Heng Luo, and Jian Zhang. 2017. Natural language inference over interaction space. arXiv preprint arXiv:1709.04348.
- Han et al. (2018) Xu Han, Hao Zhu, Pengfei Yu, Ziyun Wang, Yuan Yao, Zhiyuan Liu, and Maosong Sun. 2018. Fewrel: A large-scale supervised few-shot relation classification dataset with state-of-the-art evaluation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4803–4809. Association for Computational Linguistics.
- Hinton et al. (2012) Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov. 2012. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580.
- Hochreiter and Schmidhuber (1997) S Hochreiter and J Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735–1780.
- Kim et al. (2018) Seonhoon Kim, Jin-Hyuk Hong, Inho Kang, and Nojun Kwak. 2018. Semantic sentence matching with densely-connected recurrent and co-attentive information. arXiv preprint arXiv:1805.11360.
- Kim (2014) Yoon Kim. 2014. Convolutional neural networks for sentence classification. Eprint Arxiv.
Koch et al. (2015)
Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. 2015.
Siamese neural networks for one-shot image recognition.
ICML Deep Learning Workshop, volume 2.
- Lowe et al. (2015) Ryan Lowe, Nissan Pow, Iulian Serban, and Joelle Pineau. 2015. The ubuntu dialogue corpus: A large dataset for research in unstructured multi-turn dialogue systems. In Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 285–294. Association for Computational Linguistics.
- Mintz et al. (2009) Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. 2009. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2, pages 1003–1011. Association for Computational Linguistics.
- Mishra et al. (2018) Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. 2018. A simple neural attentive meta-learner.
- Mueller and Thyagarajan (2016) Jonas Mueller and Aditya Thyagarajan. 2016. Siamese recurrent architectures for learning sentence similarity. In AAAI, volume 16, pages 2786–2792.
- Munkhdalai and Yu (2017) Tsendsuren Munkhdalai and Hong Yu. 2017. Meta networks. arXiv preprint arXiv:1703.00837.
- Nair and Hinton (2010) Vinod Nair and Geoffrey E. Hinton. 2010. Rectified linear units improve restricted boltzmann machines. In International Conference on International Conference on Machine Learning.
- Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543. Association for Computational Linguistics.
- Ravi and Larochelle (2016) Sachin Ravi and Hugo Larochelle. 2016. Optimization as a model for few-shot learning.
- Santoro et al. (2016) Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. 2016. Meta-learning with memory-augmented neural networks. In International conference on machine learning, pages 1842–1850.
- Snell et al. (2017) Jake Snell, Kevin Swersky, and Richard Zemel. 2017. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, pages 4077–4087.
- Vinyals et al. (2016) Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. 2016. Matching networks for one shot learning. In Advances in neural information processing systems, pages 3630–3638.
- Wang et al. (2016) Linlin Wang, Zhu Cao, Gerard de Melo, and Zhiyuan Liu. 2016. Relation classification via multi-level attention cnns. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1298–1307. Association for Computational Linguistics.
- Xiong et al. (2018) Wenhan Xiong, Mo Yu, Shiyu Chang, Xiaoxiao Guo, and William Yang Wang. 2018. One-shot relational learning for knowledge graphs. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1980–1990. Association for Computational Linguistics.
- Zelenko et al. (2002) Dmitry Zelenko, Chinatsu Aone, and Anthony Richardella. 2002. Kernel methods for relation extraction. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002).
- Zeng et al. (2014) Daojian Zeng, Kang Liu, Siwei Lai, Guangyou Zhou, and Jun Zhao. 2014. Relation classification via convolutional deep neural network. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pages 2335–2344. Dublin City University and Association for Computational Linguistics.
- Zhou et al. (2016) Peng Zhou, Wei Shi, Jun Tian, Zhenyu Qi, Bingchen Li, Hongwei Hao, and Bo Xu. 2016. Attention-based bidirectional long short-term memory networks for relation classification. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 207–212. Association for Computational Linguistics.
- Zhou et al. (2018) Xiangyang Zhou, Lu Li, Daxiang Dong, Yi Liu, Ying Chen, Wayne Xin Zhao, Dianhai Yu, and Hua Wu. 2018. Multi-turn response selection for chatbots with deep attention matching network. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1118–1127. Association for Computational Linguistics.