Multi-Level Matching and Aggregation Network for Few-Shot Relation Classification

06/16/2019 ∙ by Zhi-Xiu Ye, et al. ∙ USTC 0

This paper presents a multi-level matching and aggregation network (MLMAN) for few-shot relation classification. Previous studies on this topic adopt prototypical networks, which calculate the embedding vector of a query instance and the prototype vector of each support set independently. In contrast, our proposed MLMAN model encodes the query instance and each support set in an interactive way by considering their matching information at both local and instance levels. The final class prototype for each support set is obtained by attentive aggregation over the representations of its support instances, where the weights are calculated using the query instance. Experimental results demonstrate the effectiveness of our proposed methods, which achieve a new state-of-the-art performance on the FewRel dataset.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Relation classification (RC) is a fundamental task in natural language processing (NLP), which aims to identify the semantic relation between two entities in text. For example, the instance “[London]

is the capital of [the UK]” expresses the relation capital_of between the two entities London and the UK.

Support Set
class A: mother
instance #1 The Queen Consort [Jetsun Pema] gave birth to a son on 5 February 2016 , [Jigme Namgyel Wangchuck].
instance #2 He married the American actress [Cindy Robbins] and was stepfather to her daughter , [Kimberly Beck].
instance #3 Edgar married actress [Moyna Macgill] and became the father of [Angela Lansbury].
instance #4 In 1845 , [Cemile Sultan] ’s mother , Empress [Düzdidil Kadın], died.
instance #5 Bo ’s wife [Gu Kailai] traveled with their son [Bo Guagua] to Britain.
class B: member_of
class C: father
class D: sport
class E: voice_type
Query Instance
He was married to [Eva Funck] and they have a son [Gustav] .
Table 1: A data example of 5-way-5-shot relation classification in FewRel development set. The correct relation class for the query instance is class A: mother. The instances for other relation classes are omitted for saving space.

Some conventional relation classification methods (Bethard and Martin, 2007; Zelenko et al., 2002) adopted supervised training and suffered from the lack of large-scale manually labeled data. To address this issue, the distant supervision method (Mintz et al., 2009)

was proposed which annotated training data by heuristically aligning knowledge bases (KBs) and texts. However, the long-tail problem in KBs

(Xiong et al., 2018; Han et al., 2018)

still exists and makes it hard to classify the relations with very few training samples.

This paper focuses on the few-shot relation classification task, which was designed to address the long-tail problem. In this task, only few (e.g., 1 or 5) support instances are given for each relation, as shown by an example in Table 1.

The few-shot learning problem has been studied extensively in computer vision (CV) field. Some methods adopt meta-learning architectures

(Santoro et al., 2016; Ravi and Larochelle, 2016; Finn et al., 2017; Munkhdalai and Yu, 2017), which learn fast-learning abilities from previous experiences (e.g., training set) and then rapidly generalize to new concepts (e.g., test set). Some other methods use metric learning based networks (Koch et al., 2015; Vinyals et al., 2016; Snell et al., 2017), which learn the distance distributions among classes. A simple and effective metric-based few-shot learning method is prototypical network (Snell et al., 2017). In a prototype network, query and support instances are encoded into an embedding space independently. Then, a prototype vector for each class candidate is derived as the mean of its support instances in the embedding space. Finally, classification is performed by calculating the distances between the embedding vector of the query and all class prototypes. This prototype network method has also been applied to few-shot relation classification recently (Han et al., 2018).

This paper proposes a multi-level matching and aggregation network (MLMAN) for few-shot relation classification. Different from prototypical networks, which represent support sets without dependency on query instances, our proposed MLMAN model encodes each query instance and each support set in an interactive way by considering their matching information at both local and instance levels. At local level, the local context representations of a query instance and a support set are softly matched toward each other following the sentence matching framework (Chen et al., 2017)

. Then, the matched local representations are aggregated into an embedding vector for each query and each support instance using max and average pooling. At instance level, the matching degree between the query instance and each of the support instances is calculated via a multi-layer perceptron (MLP). Taking the matching degrees as weights, the instances in a support set are aggregated to form the class prototype for final classification. All these matching and aggregation layers in the MLMAN model are estimated jointly using training data. Since the representations of the support instances in each class are expected to be close with each other, an auxiliary loss function is further designed to measure the inconsistency among all support representations in each class.

In summary, our contributions in this paper are three-fold. First, a multi-level matching and aggregation network is proposed to encode query instances and class prototypes in an interactive fashion. Second, an auxiliary loss function measuring the consistency among support instances is designed. Third, our method achieves a new state-of-the-art performance on FewRel, a public few-shot relation classification dataset.

2 Related Work

2.1 Relation Classification

Relation classification is to identify the semantic relation between two entities in one sentence. In recently years, neural networks have been widely applied to deal with this task.

Zeng et al. (2014)

employed position features and convolutional neural networks (CNNs) to capture the structure and contextual information respectively. Then, a max pooling operation was adopted to determine the most useful features.

Wang et al. (2016) proposed multi-level attention CNNs, which captured both entity-specific attention and relation-specific pooling attention in order to better discern patterns in heterogeneous contexts. Zhou et al. (2016)

proposed attention-based bidirectional long short-term memory networks (AttBLSTMs) to capture the most important semantic information in a sentence. All of these methods require a large amount of training data and can’t quickly adapt to a new class that has never been seen.

2.2 Metric Based Few-Shot Learning

In few-shot learning paradigm, a classifier is required to generalize to new classes with only a small number of training samples. The metric based approach aims to learn a set of projection functions that take support and query samples from the target problem and classify them in a feed forward manner. This approach has lower complexity and is easier for implementation than meta-learner based approach (Ravi and Larochelle, 2016; Finn et al., 2017; Santoro et al., 2016; Munkhdalai and Yu, 2017).

Some metric based few-shot learning methods have been developed for computer vision (CV) tasks, and all these methods encoded each support or query image to a vector independently for classification. Koch et al. (2015) proposed a method for learning siamese neural networks, which employed an unique structure to encode both support and query samples respectively and one more layer computing the induced distance metric between the pair. Vinyals et al. (2016) proposed to learn a matching network augmented with attention and external memories. And also, an episode-based training procedure was proposed, which was based on a principle that test and training conditions must match and has been adopted by many following studies. Snell et al. (2017) proposed prototypical networks that learn a metric space in which classification can be performed by computing distances to prototype representations of all classes, and the prototype representation of each class was the mean of all its support samples. Garcia and Bruna (2017) defined a graph neural network architecture to assimilate generic message-passing inference algorithms, which generalized above three models.

Regarding with few-shot relation classification, Han et al. (2018) adopted prototypical networks to build baseline models on the FewRel dataset. Gao et al. (2019) proposed hybrid attention-based prototypical networks to handle noisy training samples in few-shot learning. In this paper, we improve the conventional prototypical networks for few-shot relation classification by encoding the query instance and class prototype interactively through multi-level matching and aggregation.

2.3 Sentence Matching

Sentence matching is essential for many NLP tasks, such as natural language inference (NLI) (Bowman et al., 2015) and response selection (Lowe et al., 2015). Some sentence matching methods mainly rely on sentence encoding (Mueller and Thyagarajan, 2016; Conneau et al., 2017; Chen et al., 2018), which encode a pair sentences independently and then transmit their embeddings into a classifier, such as a neural network, to decide the relationship between them. Some other methods are based on joint models (Chen et al., 2017; Gong et al., 2017; Kim et al., 2018), which use cross-features to represent the local (i.e., word-level and phrase-level) alignments for better performance. In this paper, we follow the joint models to achieve the local matching between a query instance and the support set for a class. The difference between our task and the other sentence matching tasks mentioned above is that, our goal is to match a sentence to a set of sentences, instead of to another sentence (Bowman et al., 2015) or to a sequence of sentences (Lowe et al., 2015).

3 Task Definition

In few-shot relation classification, we are given two datasets, and . Each dataset consists of a set of samples , where is a sentence composed of words and the -th word is , indicate the positions of two entities, and is the relation label of the instance . These two datasets have their own relation label spaces that are disjoint with each other. Under few-shot configuration, is splited into two parts, and . If contains labeled samples for each of relation classes, this target few-shot problem is named -way--shot. contains test samples, each labeled with one of the classes. Assuming that we only have and , we can train a model using and evaluate its performance on . But limited by the number of support samples (i.e,., ), it is hard to train a good model from scratch.

Although and have disjoint relation label spaces, can also been utilized to help the few-shot relation classification on . One approach is the paradigm proposed by Vinyals et al. (2016)

, which obey an important machine learning principle that test and train conditions must match. That’s to say, we also split

into two parts, and , and mimic the few-shot learning settings at training stage. In each training iteration, classes are randomly selected from , and support instances are randomly selected from each class. In this way, we construct the train-support set , where is the k-th instance in class . And also, we randomly select samples from the remaining samples of those classes and construct the train-query set , where is the label of instance .

Figure 1: The framework of our proposed MLMAN model.

Just like conventional prototypical networks, we expect to minimize the following objective function at training time

(1)

and is defined as

(2)

The function is to calculate the matching degree between the query instance and the set of support instances . How to design this function is the focus of this paper.

4 Methodology

In this section, we will introduce our proposed multi-level matching and aggregation network (MLMAN) for modeling . For simplicity, we will discard the superscript of from Section 4.1 to Section 4.4. The framework of our proposed MLMAN model is shown in Fig. 1, which has four main modules.

  • Context Encoder. Given a sentence and the positions of two entities within this sentence, CNNs (Zeng et al., 2014) are adopted to derive the local context representations of each word in the sentence.

  • Local Matching and Aggregation. Similar to (Chen et al., 2017), given the local representation of a query instance and the local representations of support instances, the attention method is employed to collect local matching information between them. Then, the matched local representations are aggregated to represent each instance as an embedding vector.

  • Instance Matching and Aggregation. The matching information between a query instance and each of the support instances are calculated using an MLP. Then, we take the matching degrees as weights to sum the representations of support instances in order to get the class prototype.

  • Class Matching. An MLP is built to calculate the matching score between the representations of the query instance and the class prototype.

More details of these four modules will be introduced in the following subsections.

4.1 Context Encoder

For a query or support instance, each word in the sentence is first mapped into a -dimensional word embedding (Pennington et al., 2014). In order to describe the position information of the two entities in this instance, the position features (PFs) proposed by Zeng et al. (2014) are also adopted in our work. Here, PFs describe the relative distances between current word and the two entities, and are further mapped into two vectors and of dimensions. Finally, these three vectors are concatenated to get the word representation of dimensions, and the instance can be written as .

The most popular models for local context encoding are recurrent neural networks (RNNs) with long short-term memories (LSTMs)

(Hochreiter and Schmidhuber, 1997) and convolutional neural networks (CNNs) (Kim, 2014). In this paper, we employ CNNs to build the context encoder. For an input instance , we input it into a CNN with filters. The output from the CNN is a matrix with dimensions. In this way, the context representations of the query instance and the context representations of support instances are obtained, where and are the sentence lengths of the query sentence and the -th support sentence respectively.

4.2 Local Matching and Aggregation

In order to get the matching information between and , we first concatenate the support instance representations into one matrix as follow

(3)

where with . Then, we collect the matching information between and and calculate their matched representations and as follows

(4)
(5)
(6)

where in Eq. (5), in Eq. (6), and are the -th rows of and respectively, and and are the -th rows of and respectively.

Next, the original representations and the matched representations are fused utilizing a ReLU layer as follows,

(7)
(8)

where is the element-wise product and is the weight matrix at this layer for reducing dimensionality. is further split into representations corresponding to the support instances where . All and are fed into a single-layer Bi-directional LSTM (BLSTM) with hidden units along each direction to obtain the final local matching results and .

Local aggregation aims to convert the results of local matching into a single vector for each query and each support instance. In this paper, we employ a max pooling together with an average pooling, and concatenate their results into one vector or . The calculations are as follows,

(9)
(10)

where .

4.3 Instance Matching and Aggregation

Similar to conventional prototypical networks Snell et al. (2017), our proposed method calculates class prototype via the representations of all support instances in this class, i.e., . However, instead of using a naive mean operation, we aggregate instance-level representations via attention over , where each weight is derived from the instance matching score between and . The matching function is as follow,

(11)

where and . describes the instance-level matching degree between the query instance and the support instance . Then, all are aggregated into one vector as

(12)

and is the class prototype.

4.4 Class Matching

After the class prototype and the embedding vector of the query instance have been determined, the class-level matching function in Eq. (2) is defined as

(13)

Eq. (11) and (13) have the same form. In our experiments, sharing the weights and in these two equations, i.e., employing the exactly same function for both instance-level and class-level matching in each training iteration, lead to better performance.

4.5 Joint Training with Inconsistency Measurement

If the representations of all support instances in a class are far away from each other, it could become difficult for the derived class prototype to capture the common characteristics of all support instances. Therefore, a function which measures the inconsistency among the set of support instances is designed. In order to avoid the high complexity of directly comparing every two support instances in a class, we calculate the inconsistency measurement as the average Euclidean distance between the support instances and the class prototype as

(14)

where is the class index and calculates the 2-norm of a vector.

By combining Eqs. (1) and (14), the final objective function for training the whole model is defined as

(15)

where is a hyper-parameter and was set as 1 in our experiments without any tuning.

Model 5 Way 1 Shot 5 Way 5 Shot 10 Way 1 Shot 10 Way 5 Shot
Meta Network Han et al. (2018)
GNN Han et al. (2018)
SNAIL Han et al. (2018)
Prototypical Network Han et al. (2018)
Proto-HATT Gao et al. (2019) - - - -
MLMAN
Table 2: Accuracies (%) of different models on FewRel test set.

5 Experiments

5.1 Dataset and Evaluation Metrics

The few-shot relation classification dataset FewRel222https://thunlp.github.io/fewrel.html. was adopted in our experiments. This dataset was first generated by distant supervision and then filtered by crowdsourcing to remove noisy annotations. The final FewRel dataset consists of 100 relations, each has 700 instances. The average number of tokens in each sentence is 24.99, and there are 124,577 unique tokens in total. The 100 relations are split into 64, 16 and 20 for training, validation and test respectively.

Our experiments investigated four few-shot learning configurations, 5 way 1 shot, 5 way 5 shot, 10 way 1 shot, and 10 way 5 shot, which were the same as Han et al. (2018). According to the official evaluation scripts333https://thunlp.github.io/fewrel.html.

, all results given by our experiments were the mean and standard deviation values of 10 training repetitions, and were tested using 20,000 independent samples.

5.2 Training Details and Hyperparameters

All of the hyperparameters used in our experiments are listed in Table

3. The 50-dimensional Glove word embeddings released by Pennington et al. (2014) 444https://nlp.stanford.edu/projects/glove/. were adopted in the context encoder and were fixed during training. For the unknown words, we just replaced them with an unique special token UNK and fixed its embedding as a zero vector. Previous study (Munkhdalai and Yu, 2017) found that the models trained on harder tasks may achieve better performances than using the same configurations at both training and test stages. Therefore, we set to construct the train-support sets for 5-way and 10-way tasks. In our experiments, grid searches among , and

were conducted to determine their optimal values. For optimization, we employed mini-batch stochastic gradient descent (SGD) with the initial learning rate of 0.1. The learning rate was decayed to one tenth every 20,000 steps. And also, dropout layers

Hinton et al. (2012) were inserted before CNN and LSTM layers and the drop rate was set as 0.2.

Component Parameter Value
word embedding dimension 50
position feature max relative distance 40
dimension 5
CNN window size 3
filter number 200
dropout dropout rate 0.2
unidirectional LSTM hidden size 100
optimization strategy SGD
learning rate 0.1
size of query set 5
20
1
Table 3: Hyper-parameters of the models built in our experiments.
Model No. 5 Way 1 Shot 5 Way 5 Shot 10 Way 1 Shot 10 Way 5 Shot
MLMAN 1
 - 2
 IM(shared untied) 3
 IA(att. max.) 4
 IA(att. ave.) 5
  - 6
   LM(-concatenation) 7
   CM(MLP ED) 8
   -LM 9
    CM(MLP ED) 10
Table 4: Accuracies (%) of different models on FewRel development set. Here, IM stands for instance matching, IA stands for instance aggregation, LM stands for the local matching, CM stands for the class matching, MLP stands for multi-layer perceptrons and ED stands for Euclidean distance.

5.3 Comparison with Previous Work

Table 2 shows the results of different models tested on FewRel test set. The results of the first four models, Meta Network Munkhdalai and Yu (2017), GNN Garcia and Bruna (2017), SNAIL Mishra et al. (2018), Prorotypical Network Snell et al. (2017), were reported by Han et al. (2018). These models were initially proposed for image classification. Han et al. (2018) just replaced their image encoding module with an instance encoding module and kept other modules unchanged. Proto-HATT (Gao et al., 2019) added hybrid attention mechanism to prototypical networks, mainly focusing on improving the performance on few-shot relation classification with . From Table 2, we can see that our proposed MLMAN model outperforms all other models by a large margin, which shows the effectiveness of considering the interactions between query instance and support set at multiple levels.

5.4 Ablation Study

In order to evaluate the contributions of individual model components, ablation studies were conducted. Table 4 shows the performance of our model and its ablations on the development set of FewRel. Considering that the first 6 ablations only affected the few-shot learning tasks with , model 2 to model 7 achieved exactly the same performance as the complete model (i.e., model 1) under 5 way 1 shot and 10 way 1 shot configurations.

5.4.1 Instance Matching and Aggregation

First, the attention-based instance aggregation introduced in Section 4.3 was replaced with a max pooling (model 4) or an average pooling (model 5). We can see that the model with instance-level attentive aggregation (model 1) outperformed the ones using a max pooling (model 4) or an average pooling (model 5) on 5-shot tasks. Their difference were significantly at 1% significance level in t-test. The advantage of attentive pooling is that the weights of integrating all support instances can be determined dynamically according to the query. For example, when conducting instance matching and aggregation between the query instance and the support set in Table

1, the weights of the 5 instances in class A were 0.03, 0.46, 0.25, 0.08 and 0.18 respectively. Instance #2 achieved the highest weight because it had the best similarity with the query instance and was considered as the most helpful one when matching the query instance with class A.

Then, the effectiveness of sharing the weight parameters in Eqs. (11) and (13) was evaluated by untying them (model 3). The performance of model 3 was much worse than the complete model (model 1) as shown in Table 4, which demonstrates the need of sharing the weights for calculating matching scores at both instance and class levels.

5.4.2 Inconsistency Measurement

As introduced in Section 4.5, is designed to measure the inconsistency among the representations of all support instances in a class. After removing , model 2 was optimized only using the objective function . We can see that it performed much worse than the complete model. Furthermore, we calculated the mean of the Euclidean distances between every support instance pair in the same class using model 1 and model 2 respectively. For each support set, the calculation can be written as

(16)

We sampled 20,000 support sets under the 5-way 5-shot configuration and calculated the mean of them. The results were and for model 1 and model 2 respectively, which means that was effective at forcing the representations of the support instances in the same class to be close with each other.

was further removed from model 5 and model 6 was obtained. It can be found that the accuracy degradation from model 5 to model 6 was larger than the one from model 1 to model 2. This implies that the objective function also benefited from the attentive aggregation over support instances.

5.4.3 Local Matching

First, the concatenation operation in local matching was removed from model 6 in this ablation study. That’s to say, instead of concatenating the representations of all support instances into one single matrix as Eq. (3), local matching was conducted between the query instance and each support instance separately to get their vector representations (model 7). It should be noticed that this led to different representations of a query instance according to each support class. Then, the mean over for and were calculated to get the representations of the support set and the query instance . Comparing model 6 and model 7, we can see that the concatenation operation plays an important role in our model. One possible reason is that the concatenation operation can help local matching to restrain the support instances with low similarity to the query.

Second, the whole local matching module together with the concatenation and attentive aggregation operation were removed from model 6, which led to model 9. Model 9 is similar to the one proposed by Snell et al. (2017) that encoded the support and query instances independently. The difference was that model 9 was equipped with more components, including an LSTM layer, two pooling operations, and a learnable class matching function. Comparing the performance of model 6 and model 9 in Table 4, we can see that the local matching operation significantly improves the performance in few-shot relation classification. Fig. 2 shows the attention weight matrix calculated between the query instance and the support instance #2 of class A in Table 1. From this figure, we can see that the attention-based local matching is able to capture some matching relations of local contexts, such as the head entities Eva Funck and Cindy Robbins, the tail entities Gustav and Kimberly Beck, the key phrases son and daughter, the same keyword “married”, and so on.

5.4.4 Class Matching

In this experiment, we compared two class matching functions, (1) Euclidean distance (ED) Snell et al. (2017) and (2) a learnable MLP function as shown by Eq. (13). In order to ignore the influence of the instance-level attentive aggregation, these two matching functions were compared based on model 6 and model 9. After converting the MLP function in model 6 and model 9 to Euclidean distance, model 8 and model 10 were obtained. Comparing the performance of these models in Table 4, we have two findings. (1) When local matching was adopted, the learnable MLP for class matching (model 6) outperformed the ED metric (model 8) by a large margin. (2) After removing local matching, the learnable MLP for class matching (model 9) performed not as good as the ED metric (model 10). One possible reason is that the local matching process enhances the interaction between a query instance and a support set when calculating and . Thus, simple Euclidean distance between them may not be able to describe the complex correlation and dependency between them. On the other hand, MLP mapping is more powerful than calculating Euclidean distance, and can be more appropriate for class matching when local matching is also adopted.

Figure 2: The attention weight matrix calculated between the query instance and the support instance #2 of class A in Table 1. The darker units have larger value. The summation of one column in the matrix is one.

6 Conclusions

In this paper, a neural network with multi-level matching and aggregation has been proposed for few-shot relation classification. First, the query and support instances are encoded interactively via local matching and aggregation. Then, the support instances in a class are further aggregated to form the class prototype and the weights are calculated by attention-based instance matching. Finally, a learnable MLP matching function is employed to calculate the class matching score between the query instance and each candidate class. Furthermore, an additional objective function is designed to improve the consistency among the vector representations of all support instances in a class. Experiments have demonstrated the effectiveness of our proposed model, which achieves state-of-the-art performance on the FewRel dataset. Studying few-shot relation classification with data generated by distant supervision and extending our MLMAN model to zero-shot learning will be the tasks of our future work.

Acknowledgments

We thank the anonymous reviewers for their valuable comments. This work was partially funded by the National Nature Science Foundation of China (Grant No. U1636201, 61871358).

References