KEML: A Knowledge-Enriched Meta-Learning Framework for Lexical Relation Classification

02/25/2020 ∙ by Chengyu Wang, et al. ∙ East China Normal University 0

Lexical relations describe how concepts are semantically related, in the form of relation triples. The accurate prediction of lexical relations between concepts is challenging, due to the sparsity of patterns indicating the existence of such relations. We propose the Knowledge-Enriched Meta-Learning (KEML) framework to address the task of lexical relation classification. In KEML, the LKB-BERT (Lexical Knowledge Base-BERT) model is presented to learn concept representations from massive text corpora, with rich lexical knowledge injected by distant supervision. A probabilistic distribution of auxiliary tasks is defined to increase the model's ability to recognize different types of lexical relations. We further combine a meta-learning process over the auxiliary task distribution and supervised learning to train the neural lexical relation classifier. Experiments over multiple datasets show that KEML outperforms state-of-the-art methods.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As an important type of linguistic resources, lexical relations describe semantic associations between concepts. Such resources are organized as backbones in lexicons 

Miller (1995), semantic networks Speer et al. (2017), etc. The explicit usage of such resources has benefited a variety of NLP tasks, including relation extraction Shen et al. (2018), question answering Yang et al. (2017) and machine translation Thompson et al. (2019).

To accumulate such knowledge, Lexical Relation Classification (LRC) is a basic NLP task to classify concepts into a finite set of lexical relations. In the literature, pattern-based and distributional methods are two major types of LRC models Shwartz and Dagan (2016); Wang et al. (2017a). However, compared to the classification of factual relations

for knowledge graph population 

Liu et al. (2017); Wu and He (2019), the accurate classification of lexical relations has more challenges. i) Most lexical relations represent the common sense of human knowledge, not frequently expressed in texts explicitly111For example, “(car, meronymy, steering wheel)” can be paraphrased as “steering wheels are part of cars”. However, this expression is usually omitted in texts, since it is basically common sense to humans.. Apart from Hearst patterns Hearst (1992) for hypernymy (“is-a”) extraction, textual patterns that indicate the existence of other types of lexical relations remain few, leading to the “pattern sparsity” problem Shwartz and Dagan (2016); Washio and Kato (2018a). ii) Distributional models assume concepts with similar contexts have similar embeddings Mikolov et al. (2013); Bojanowski et al. (2017). Representations of a concept pair learned by traditional word embedding models are not sufficient to distinguish different types of lexical relations Glavas and Vulic (2018); Ponti et al. (2019). iii) Many LRC datasets are highly imbalanced w.r.t. training instances of different lexical relations, and may contain randomly paired concepts. It is difficult for models to distinguish whether a concept pair has a particular type of lexical relation, or has very weak or no semantic relatedness.

In this work, the Knowledge-Enriched Meta-Learning (KEML) framework is presented to address these challenges for LRC, consisting of three modules: Knowledge Encoder, Auxiliary Task Generator and Relation Leaner. In Knowledge Encoder, we propose the LKB-BERT (Lexical Knowledge Base-BERT) model to learn relation-sensitive concept representations. LKB-BERT is built upon BERT Devlin et al. (2019) and trained via new distant supervised learning tasks over lexical knowledge bases, which encodes both language patterns and relational lexical knowledge into the model. In Auxiliary Task Generator, we treat recognizing single type of lexical relations as auxiliary tasks. Based on meta-learning Finn et al. (2017), a probabilistic task distribution is properly defined for the model to optimize, which addresses the imbalanced property and the existence of random relations in LRC datasets. In Relation Leaner

, we combine a gradient-based meta-learning process over the auxiliary task distribution and supervised learning to train the final neural relation classifier. Especially, a relation recognition cell is designed and integrated into the neural network for the purpose.

This paper makes the following contributions:

  • We propose LKB-BERT to learn concept representations for LRC, considering unstructured texts and relational lexical knowledge.

  • A meta-learning process with auxiliary tasks for single relation recognition is proposed to improve the performance of LRC.

  • We evaluate KEML over multiple LRC benchmark datasets. Experimental results show that KEML outperforms state-of-the-art methods.222All the datasets are publicly available. The source codes will be related upon paper acceptance.

2 Related Work

In this section, we overview related work on LRC, pre-trained language models and meta-learning.

As summarized in Shwartz and Dagan (2016), Lexical Relation Classification (LRC) models are categorized into two major types: pattern-based and distributional. Pattern-based approaches extract patterns w.r.t. a concept pair from texts as features to predict its lexical relation. For hypernymy relations, Hearst patterns Hearst (1992) are most influential, often used for the construction of large-scale taxonomies Wu et al. (2012). To learn patterns representations, Shwartz et al. (2016) exploit LSTM-based RNNs to encode dependency paths of patterns. Roller et al. (2018); Le et al. (2019) calculate Hearst pattern-based statistics from texts and design hypernymy measures to predict the degrees of hypernymy between concepts. For other types of relations, LexNET Shwartz and Dagan (2016) extends the network architecture Shwartz et al. (2016) for multi-way classification of lexical relations. Nguyen et al. (2016, 2017) design path-based neural networks to distinguish antonymy and synonymy. However, these methods may suffer from the lack of patterns and the occurrence of concept pairs in texts Washio and Kato (2018a).

With the rapid development of deep neural language models, distributional models attract more interest. While traditional methods directly leverage the two concepts’ embeddings as features for classifier training Weeds et al. (2014); Vylomova et al. (2016); Roller and Erk (2016), they may suffer from the “lexical memorization” problem Levy et al. (2015). Recently, more complicated neural networks are proposed to encode the semantics of lexical relations. Attia et al. (2016)

formulate LRC as a multi-task learning task and propose a convolutional neural network for LRC.  

Mrksic et al. (2017) propose the Attract-Repe model to learn the semantic specialization of word embeddings.  Glavas and Vulic (2018)

introduce the Specialization Tensor Model, which learns multiple relation-sensitive specializations of concept embeddings. SphereRE 

Wang et al. (2019) encodes concept pairs in the hyperspherical embedding space and achieves state-of-the-art results. There exist some models to learn word-pair representations for other NLP tasks Washio and Kato (2018b); Joshi et al. (2019); Camacho-Collados et al. (2019). KEML is also distributional, further improving LRC by training meta-learners over neural language models.

Pre-trained language models have gained attention from the NLP community. ELMo Peters et al. (2018) learns context-sensitive embeddings for each token form both left-to-right and right-to-left. BERT Devlin et al. (2019) is a notable work, employing layers of transformer encoders to learn language representations. Follow-up works include Transformer-XL Dai et al. (2019), XLNet Yang et al. (2019), ALBERT Lan et al. (2019) and many more Yet another direction is to fuse additional knowledge sources into BERT-like models. ERNIE Zhang et al. (2019) incorporates the rich semantics of entities in the model. KG-BERT Yao et al. (2019) and K-BERT Liu et al. (2019) employ relation prediction objectives in knowledge graphs as additional learning tasks. In our work, we leverage the conceptual facts in lexical knowledge bases to improve the representation learning for LRC.

Meta-learning is a learning paradigm to train models that can adapt to a variety of different tasks with little training data Vanschoren (2018), mostly applied to few-shot learning. In NLP, meta-learning algorithms have not been extensively employed, mostly due to the large numbers of training examples required to train the model for different NLP tasks. Existing models mostly focus on training meta-learners for single applications, such as link prediction Chen et al. (2019), dialog systems Madotto et al. (2019) and semantic parsing Guo et al. (2019)Dou et al. (2019) leverage meta-learning for various low-resource natural language understanding tasks. KEML is one of the early attempts to improve LRC via gradient-based meta-learning Finn et al. (2017). Since the mechanism of meta-learning is not our major research focus, we do not further elaborate.

Figure 1: The high-level framework of KEML (best viewed in color).

3 The KEML Framework

In this section, we formally describe the KEML framework for LRC. A brief technical flow is presented, followed by its algorithmic details.

3.1 A Brief Overview of KEML

We first overview the LRC task briefly. Denote as an arbitrary concept pair. The goal of LRC is to learn a classifier to predict the lexical relation between and , based on a labeled, training set . Here, is the collection of all pre-defined lexical relation types (e.g., hypernymy, synonymy), (possibly) including a special relation type (“random”), depending on different task settings. It means that and are randomly paired, without clear association with any lexical relations.

The framework of KEML is illustrated in Figure 1, with three modules introduced below:

Knowledge Encoder. Representation learning for LRC is significantly different from learning traditional word embeddings. This is because some lexical concepts are naturally Multiword Expressions (e.g., card game, orange juice), in which the entire sequences of tokens should be encoded in the embedding space Cordeiro et al. (2016). Additionally, these models are insufficient to capture the lexical relations between concepts, due to the pattern sparsity issue Washio and Kato (2018a). Hence, the semantics of concepts should be encoded from a larger corpus and rich language resources.

Inspired by BERT Devlin et al. (2019), and its extensions, we propose the LKB-BERT (Lexical Knowledge Base-BERT) model to encode the semantics of concepts from massive text corpora and lexical knowledge bases. LKB-BERT employs the neural architecture and pre-trained parameters of BERT Devlin et al. (2019) for token encoding, and imposes two new distant supervised learning objectives over lexical knowledge bases (such as WordNet Miller (1995)) as fine-tuning tasks. After model training, each concept receives the embeddings of the last transformer encoding layer of LKB-BERT as the its representation. Denote the embeddings of as , with the dimension as .

Auxiliary Task Generator. As discussed, training concept embedding based classifiers directly may produce sub-optimal results due to the highly imbalanced nature of LRC training sets and the existence of relations. Inspired by the design philosophy of meta-learning Finn et al. (2017, 2018) and its NLP applications Chen et al. (2019); Madotto et al. (2019), we regard the relation classifier as the meta-learner, and design a series of auxiliary tasks to update model parameters. Each task aims at distinguishing between concept pairs that have a particular relation and randomly paired concepts. The training sets for auxiliary tasks are sampled from subsets of . Denote the collection of all auxiliary tasks as . The meta-learner is optimized over a probabilistic distributions of tasks . By designing properly, the relation classifier is capable of alleviating the imbalanced classification and the relation problems of LRC at the same time.

Relation Leaner. Finally, we design a two-stage algorithm to train the neural relation classifier : i) meta-learning and ii) supervised learning. In the meta-learning stage, the adapted model parameters of neural networks are iteratively learned over the distribution . Therefore, the neural network learns how to recognize specific lexical relations, with the guidance of the underlying lexical knowledge base. Here, a special cell, i.e., Single Relation Recognition Cell is designed and integrated into the neural network. In the supervised learning stage, we fine-tune meta-learned parameters to obtain the multi-way classification model for LRC over .

3.2 Knowledge Encoder

We consider the training of LKB-BERT as a variant of the fine-tune process of BERT Devlin et al. (2019). In original BERT, the inputs are arbitrary spans of token sequences. To encode the semantics of concept pairs, we combine a concept pair to form a sequence of tokens, separated by a special token “[SEP]” as the input for LKB-BERT (see Figure 1). We first initialize all the model parameters of transformer encoders to be the same as BERT’s pre-training results. Different from any standard fine-tuning tasks in BERT Devlin et al. (2019) or KG-BERT Yao et al. (2019), to address the problem, LKB-BERT learns to classify a concept pair into the lexical relation collection (including the special relation type ).

Let be the collection of labeled concept pairs in lexical knowledge bases. For each concept pair with its label , we compute as the predicted score w.r.t. the lexical relation by LKB-BERT’s transformer encoders (we have and ). The first loss, i.e., the multi-way relation classification loss is defined as:333For simplicity, we omit all the regularization terms of objective functions throughout this paper.

where is the indicator function that returns 1 if the input expression is true; and 0 otherwise.

To improve LKB-BERT’s ability to recognize concept pairs without any lexical relations, we add a binary cross-entropy loss for LKB-BERT to optimize. We only need LKB-BERT to learn whether a concept pair are randomly paired. Let be any non-random lexical relation types in . The complete objective of LKB-BERT is: , with to be:

In KEML, we regard lexical relations sampled from WordNet Miller (1995) and in training sets as sources of . As for the neural network structure, LKB-BERT has two sets of classification outputs. Refer to the C and C units of LKB-BERT.

3.3 Auxiliary Task Generator

Although LKB-BERT is capable of learning deep concept representations, using such features for classifier training is insufficient. The reasons are twofold. i) Direct classification can suffer from “lexical memorization” Levy et al. (2015), meaning that the relation classifier only learns the individual characteristics of two concepts alone. ii) The LRC datasets are highly imbalanced. For example, in the widely used dataset EVALution Santus et al. (2015), the numbers of training instances w.r.t. several lexical relation types are very few. Hence, the learning bias of the classifier trained by naive approaches is almost unavoidable.

Finn et al. (2017) observe that meta-learning achieves better parameter initialization for few-shot learning, compared to multi-task learning across all the tasks. In KEML, we propose a series of auxiliary tasks , where each task (named Single Relation Recognition) corresponds to a specific type of lexical relation (excluding ). The goal is to distinguish concept pairs with the lexical relation type and randomly paired concepts. Let and be the collection of concept pairs with lexical relations as and , respectively, randomly sampled from the training set . The goal of learning auxiliary task

is to minimize the following loss function



is the predicted probability of the concept pair

having the lexical relation .

A remaining problem is the design of the probabilistic distribution of auxiliary tasks . We need to consider two issues. i) The semantics of all types of lexical relations should be fully learned. ii) Assume the batch sizes for all tasks are the same, i.e., . Tasks related to lexical relations with more training instances should be learned more frequently by the meta-learner. Let be the subset of the training set with the lexical relation as . , if , we have the sampling probability . Hence, we define empirically as follows:

where is the smoothing factor. The expectation of all the losses of auxiliary tasks (represented as ) is: , which is the real learning objective that these auxiliary tasks aim to optimize.

3.4 Relation Learner

In this part, we introduce the meta-learning algorithm for LRC. Assume the relation classifier is parameterized by , with learning and meta-learning rates as and . Relation Learner has two stages: i) meta-learning and ii) supervised learning. For each iteration in meta-learning, we sample auxiliary tasks from . For each auxiliary task , we learn adapted parameters based on two sampled subsets: and , to make the model to recognize one specific type of lexical relations. After that, the adapted parameters on each task are averaged and updated to . We simplify the meta-update step by only taking first-order derivatives Nichol et al. (2018) to avoid the time-consuming second-order derivative computation. For supervised learning, we fine-tune the parameters of the classifier to obtain the multi-way LRC model over the entire training set . The algorithmic description is shown in Algorithm 1.

1:  Initialize model parameters ;
2:  while not converge do
3:     Sample auxiliary tasks from the task distribution ;
4:     for each auxiliary task  do
5:         Sample a batch (positive samples and negative samples ) from the training set ;
6:         Update adapted parameters: based on and ;
7:     end for
8:     Update meta-parameters: ;
9:  end while
10:  Fine-tune over by standard supervised learning LRC;
Algorithm 1 Meta-Learning Algorithm for LRC
Figure 2: Structure of SRR Cell (we only show one cell, with some other parts of the network omitted).

Finally, we describe the neural network structure for LRC. In this network, the Single Relation Recognition Cell (SRR Cell) is deigned for learning auxiliary tasks and enabling knowledge injection, with the structure illustrated in Figure 2. For each lexical relation , we extract the relation prototype from the lexical knowledge base by averaging all the embedding offsets of concept pairs with relation :

We use as features because the Diff model is effective for representing semantic relations Fu et al. (2014); Vylomova et al. (2016); Wang et al. (2019). Consider the SRR Cell structure in Figure 2. Given the inputs , and , we compute the -dimensional hidden states and by:

where , , and are the weights and biases of these states. This can be interpreted as inferring the embeddings of relation objects or subjects, given the relation prototype and subjects/objects as inputs, similar to knowledge graph completion Wang et al. (2017b). Next, we compute the offsets and and two new -dimensional hidden states and , with , , and as learnable parameters:

We can see that if and actually have the lexical relation , and should be good indicators of the existence of such relations. For example, one way to interpret and is that tries to infer the relation object given and as inputs, and makes the judgment by comparing and the true relation object embedding . Hence, the network learns whether is a good fit for the concept pair . The functionalities of and are similar, only with directions reversed. After that, we concatenate and as part of the inputs for the next layer.

Re-consider the entire network structure in Figure 1. For each concept pair , we compare and with all relation prototypes (treated as constants in the network). The results are represented by vectors of hidden states. After that, a dense layer and multiple output layers are connected. During the meta-learning stage, we train binary meta-classifiers to minimize (), with meta-parameters updated. In the supervised learning stage, we discard all the output layers of meta-classifiers, and train the final LRC model. This is because the numbers of output units of meta-classifiers and the final classifier are different. The parameters of the last layer can not be re-used. We also need to note that additional skip connections between , and the dense layer are added, in order to improve the effect of back propagation during training He et al. (2016).

Discussion. KEML employs a “divide-and-conquer” strategy for LRC. In mate-learning, each SRR Cell learns the semantics of single lexical relation type, and also handles the problem. Hence, the supervised learning process of the relation classifier can be improved with better parameter initializations. Unlike traditional meta-learning Finn et al. (2017, 2018), KEML does not contain meta-testing steps (since LRC is not a few-shot learning problem), but takes advantages of both meta-learning and supervised learning.

Method K&H+N BLESS ROOT09 EVALution
Pre Rec F1 Pre Rec F1 Pre Rec F1 Pre Rec F1
Concat 0.909 0.906 0.904 0.811 0.812 0.811 0.636 0.675 0.646 0.531 0.544 0.525
Diff 0.888 0.886 0.885 0.801 0.803 0.802 0.627 0.655 0.638 0.521 0.531 0.528
NPB 0.713 0.604 0.55 0.759 0.756 0.755 0.788 0.789 0.788 0.53 0.537 0.503
NPB+Aug - - 0.897 - - 0.842 - - 0.778 - - 0.489
LexNET 0.985 0.986 0.985 0.894 0.893 0.893 0.813 0.814 0.813 0.601 0.607 0.6
LexNET+Aug - - 0.970 - - 0.927 - - 0.806 - - 0.545
SphereRE 0.990 0.989 0.990 0.938 0.938 0.938 0.860 0.862 0.861 0.62 0.621 0.62
LKB-BERT 0.981 0.982 0.981 0.939 0.936 0.937 0.863 0.864 0.863 0.638 0.645 0.639
KEML-S 0.984 0.983 0.984 0.942 0.940 0.941 0.877 0.871 0.873 0.649 0.651 0.644
KEML 0.993 0.993 0.993 0.944 0.943 0.944 0.878 0.877 0.878 0.663 0.660 0.660
Table 1: LRC results over four benchmark datasets in terms of Precision, Recall and F1.

4 Experiments

We conduct extensive experiments to evaluate KEML over multiple benchmark datasets, and compare it with state-of-the-art methods.

4.1 Datasets and Experimental Settings

We employ Google’s pre-trained BERT model444We use the uncased, base version of BERT. See to initialize the parameters of LKB-BERT. The lexical knowledge base contains 16.7K relation triples 555To avoid data leakage, we have removed relation triples in  that overlap with all validation and testing sets.. Following Wang et al. (2019) (which produced state-of-the-art results for LRC previously), we use the five public benchmark datasets for multi-way classification of lexical relations to evaluate KEML, namely, K&H+N Necsulescu et al. (2015), BLESS Baroni and Lenci (2011), ROOT09 Santus et al. (2016b), EVALution Santus et al. (2015) and CogALex-V Subtask 2 Santus et al. (2016a). Due to space limitation, we do not introduce all the datasets here. Refer to Wang et al. (2019) for the statistical summarization of all the datasets. K&H+N, BLESS, ROOT09 and EVALution are partitioned into training, validation and testing sets, following the extract same settings as in Shwartz and Dagan (2016). The CogALex-V dataset has training and testing sets only, with no validation sets provided Santus et al. (2016a). Hence, we randomly sample 80% of the training set to learn the parameters, and use the rest for parameter tuning.

The default hyper-parameter settings of KEML are as follows: 666We empirically set to ensure that in each iteration of the meta-learning process, each auxiliary task is learned once in average., and . We use

as the activation function, and Adam as the optimizer to train the neural network. All the model parameters are

-regularized, with the hyper-parameter . The batch size is set as 256. The dimension of hidden layers is set as the same of

(768 for the base BERT model). The number of parameters of the final neural classifier is around 7M to 24M, depending on the number of classes. The algorithms are implemented with TensorFlow and trained with NVIDIA Tesla P100 GPU. For evaluation, we use Precision, Recall and F1 as metrics, reported as the average of all the classes, weighted by the support.

4.2 General Experimental Results

We follow the experimental steps Wang et al. (2019) to evaluate KEML over K&H+N, BLESS, ROOT09 and EVALution. Since EVALution does not contain relations, during the meta-learning process of auxiliary task , we randomly sample relation triples from that do not have the relation , and take them as . We manually tune the regularization hyper-parameter from to using the validation set (based on F1) and report the performance over the testing set. As for baselines, we consider traditional distributional models Concat Baroni et al. (2012) and Diff Weeds et al. (2014), pattern-based models NPB Shwartz et al. (2016), LexNET Shwartz and Dagan (2016), NPB+Aug and LexNET+Aug Washio and Kato (2018a), and the state-of-the-art model SphereRE Wang et al. (2019). We refer readers to the following papers Shwartz and Dagan (2016); Washio and Kato (2018a); Wang et al. (2019) for the detailed descriptions of these baselines. Additionally, we implement two variants of our approach: i) LKB-BERT (using trained concept representations to predict lexical relations) and ii) KEML-S (KEML without the meta-learning stage). The results of KEML and all the baselines are summarized in Table 1.

As shown, KEML outperforms all baselines, especially over BLESS, ROOT09 and EVALution. As for K&H+N, KEML produces a slightly better F1 score (0.3%) than the strongest baseline SphereRE Wang et al. (2019). A probable cause is that K&H+N is an “easy” dataset (99% F1 by SphereRE), leaving little room for improvement. Comparing KEML against LKB-BERT and KEML-S, we can conclude that, the knowledge enrichment technique for concept representation learning and the meta-learning algorithm are highly beneficial for accurate prediction of lexical relations.

4.3 Results of CogALex-V Shared Task

We evaluate KEML over the CogALex-V Shared Task (Subtask 2) Santus et al. (2016a). This dataset is most challenging as it contains a large number of random word pairs and disables “lexical memorization”. The organizer requires participants to discard the results of the random class from average, and report the F1 scores for each type of lexical relations. We consider two top systems reported in this task (i.e., GHHH Attia et al. (2016) and LexNET Shwartz and Dagan (2016)), as well as two recent models that have been evaluated over the shared task (i.e., STM Glavas and Vulic (2018) and SphereRE Wang et al. (2019)

) as strong competitors. Because the training set contains an overwhelming number of random word pairs, during the training process of KEML-S and KEML, we randomly discard 70% (manually tuned) of the random pairs in each epoch

777This trick improves the performance of KEML-S and KEML by 1.8% and 2.2%, in terms of overall F1.. Results are reported in Table 2, showing that KEML achieves the highest F1 score of 50.0%. It also has highest scores on three types of lexical relations: synonymy (SYN), hypernymy (HYP) and meronymy (MER).

GHHH 0.204 0.448 0.491 0.497 0.423
LexNET 0.297 0.425 0.526 0.493 0.445
STM 0.221 0.504 0.498 0.504 0.453
SphereRE 0.286 0.479 0.538 0.539 0.471
LKB-BERT 0.281 0.470 0.532 0.530 0.464
KEML-S 0.276 0.470 0.542 0.631 0.485
KEML 0.292 0.492 0.547 0.652 0.500
Table 2: LRC results for each lexical relation types over the CogALex-V shared task in terms of F1.
(a) Dataset: K&H+N
(b) Dataset: BLESS
(c) Dataset: ROOT09
Figure 3: Accuracy of single relation prediction during the meta-learning process (best viewed in color).
Concept Pairs Predicted True Concept Pairs Predicted True
(turtle, frog) Synonym Co-hyponym (draw, pull) Random Synonym
(bowl, glass) Co-hyponymy Meronymy (symbolism, connection) Random Hypernym
(cannon, warrior) Synonym Random (affection, healthy) Co-hyponym Attribute
Table 3: Cases of prediction errors in the experiments. Due to different expressions of relation names in all datasets, we map the relation names in these datasets to relation names in WordNet.

4.4 Detailed Analysis of KEML

To facilitate deeper understanding, we conduct additional experiments to analyze KEML’s components. We first study how knowledge-enriched concept representation learning benefits LRC. We implement three models: LKB-BERT (Binary), LKB-BERT (Multi) and LKB-BERT (Full). LKB-BERT (Binary) and LKB-BERT (Multi) only fine-tune on single objective: and , respectively. LKB-BERT (Full) is the full implementation, as described previously. We take the two concepts’ representations as features ( and ) to train relation classifiers for LRC and report the results over the validation sets. For fair comparison, we use neural networks with one hidden layer (with the dimension , and the activation function ) as classifiers in all the experiments and present the results in Table 4. We can see that the improvement of using as the objective function is consistent across all the datasets (which ranges from 0.8% to 1.9% in terms of F1). Particularly, LKB-BERT outperforms the strongest baseline SphereRE in a few cases (for example, the dataset ROOT09, as shown in Table 1 and Table 4). Hence, LKB-BERT is capable of encoding knowledge in patterns and lexical knowledge bases to represent the semantics of lexical relations.

(Binary) (Multi) (Full)
K&H+N 0.964 0.972 0.983
BLESS 0.921 0.929 0.939
ROOT09 0.854 0.861 0.863
EVALution 0.630 0.632 0.641
CogALex-V 0.464 0.467 0.472
Table 4: LRC results using concept embeddings generated by LKB-BERT and variants in terms of F1.

Next, we look into the meta-learning process in KEML. We test whether SRR Cells can distinguish a specific type of lexical relations from random concept pairs. In each iteration of meta-learning, we sample another batch of positive and negative samples from and compute the accuracy of single relation recognition. Figure 3 illustrates how accuracies changes through time in K&H+N, BLESS and ROOT09. Within 100 iterations, our models can achieve desirable performance efficiently, achieving good parameter initializations for LRC. This experiment also explains why KEML produces better results than KEML-S.

4.5 Error Analysis

We analyze prediction errors produced by KEML. Because the inputs of our task are very simple and the interpretation of deep neural language models is still challenging, the error analysis process is rather difficult. Here, we analyze the causes of errors from a linguistic point of view, with some cases presented in Table 3. As seen, some types of lexical relations are very similar in semantics. For instance, concept pairs with the synonymy relation and the co-hyponymy relation are usually mapped similar positions in the embedding space. Hence, it is difficult for models to distinguish the differences between the two relations without rich contextual information available. Another problem is that some of the relations are “blurry” in semantics, making KEML hard to discriminate between these relations and random relations.

5 Conclusion and Future Work

In this paper, we present the Knowledge-Enriched Meta-Learning (KEML) framework to distinguish different lexical relations. Experimental results confirm that KEML outperforms state-of-the-art approaches. Future work includes: i) improving concept representation learning with deep neural language models; ii) integrating richer linguistic and commonsense knowledge into KEML; and iii) extending KEML to other similar semantics-intensive NLP tasks, such as natural language inference.


  • Attia et al. (2016) Mohammed Attia, Suraj Maharjan, Younes Samih, Laura Kallmeyer, and Thamar Solorio. 2016. Cogalex-v shared task: GHHH - detecting semantic relations via word embeddings. In CogALex@COLING, pages 86–91.
  • Baroni et al. (2012) Marco Baroni, Raffaella Bernardi, Ngoc-Quynh Do, and Chung-chieh Shan. 2012. Entailment above the word level in distributional semantics. In EACL, pages 23–32.
  • Baroni and Lenci (2011) Marco Baroni and Alessandro Lenci. 2011. How we blessed distributional semantic evaluation. In GEMS Workshop.
  • Bojanowski et al. (2017) Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. TACL, 5:135–146.
  • Camacho-Collados et al. (2019) José Camacho-Collados, Luis Espinosa Anke, and Steven Schockaert. 2019. Relational word embeddings. In ACL, pages 3286–3296.
  • Chen et al. (2019) Mingyang Chen, Wen Zhang, Wei Zhang, Qiang Chen, and Huajun Chen. 2019. Meta relational learning for few-shot link prediction in knowledge graphs. In EMNLP-IJCNLP, pages 4216–4225.
  • Cordeiro et al. (2016) Silvio Cordeiro, Carlos Ramisch, Marco Idiart, and Aline Villavicencio. 2016. Predicting the compositionality of nominal compounds: Giving word embeddings a hard time. In ACL, page 1986–1997.
  • Dai et al. (2019) Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G. Carbonell, Quoc Viet Le, and Ruslan Salakhutdinov. 2019. Transformer-xl: Attentive language models beyond a fixed-length context. In ACL, pages 2978–2988.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL, pages 4171–4186.
  • Dou et al. (2019) Zi-Yi Dou, Keyi Yu, and Antonios Anastasopoulos. 2019. Investigating meta-learning algorithms for low-resource natural language understanding tasks. In EMNLP-IJCNLP, pages 1192–1197.
  • Finn et al. (2017) Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, volume 70, pages 1126–1135.
  • Finn et al. (2018) Chelsea Finn, Kelvin Xu, and Sergey Levine. 2018. Probabilistic model-agnostic meta-learning. In NeurIPS, pages 9537–9548.
  • Fu et al. (2014) Ruiji Fu, Jiang Guo, Bing Qin, Wanxiang Che, Haifeng Wang, and Ting Liu. 2014. Learning semantic hierarchies via word embeddings. In ACL, pages 1199–1209.
  • Glavas and Vulic (2018) Goran Glavas and Ivan Vulic. 2018. Discriminating between lexico-semantic relations with the specialization tensor model. In NAACL, pages 181–187.
  • Guo et al. (2019) Daya Guo, Duyu Tang, Nan Duan, Ming Zhou, and Jian Yin. 2019. Coupling retrieval and meta-learning for context-dependent semantic parsing. In ACL, pages 855–866.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR, pages 770–778.
  • Hearst (1992) Marti A. Hearst. 1992. Automatic acquisition of hyponyms from large text corpora. In COLING, pages 539–545.
  • Joshi et al. (2019) Mandar Joshi, Eunsol Choi, Omer Levy, Daniel S. Weld, and Luke Zettlemoyer. 2019. pair2vec: Compositional word-pair embeddings for cross-sentence inference. In NAACL, pages 3597–3608.
  • Lan et al. (2019) Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. ALBERT: A lite BERT for self-supervised learning of language representations. CoRR, abs/1909.11942.
  • Le et al. (2019) Matt Le, Stephen Roller, Laetitia Papaxanthos, Douwe Kiela, and Maximilian Nickel. 2019. Inferring concept hierarchies from text corpora via hyperbolic embeddings. In ACL, pages 3231–3241.
  • Levy et al. (2015) Omer Levy, Steffen Remus, Chris Biemann, and Ido Dagan. 2015. Do supervised distributional methods really learn lexical inference relations? In NAACL, pages 970–976.
  • Liu et al. (2017) Liyuan Liu, Xiang Ren, Qi Zhu, Shi Zhi, Huan Gui, Heng Ji, and Jiawei Han. 2017. Heterogeneous supervision for relation extraction: A representation learning approach. In EMNLP, pages 46–56.
  • Liu et al. (2019) Weijie Liu, Peng Zhou, Zhe Zhao, Zhiruo Wang, Qi Ju, Haotang Deng, and Ping Wang. 2019. K-BERT: enabling language representation with knowledge graph. CoRR, abs/1909.07606.
  • Madotto et al. (2019) Andrea Madotto, Zhaojiang Lin, Chien-Sheng Wu, and Pascale Fung. 2019. Personalizing dialogue agents via meta-learning. In ACL, pages 5454–5459.
  • Mikolov et al. (2013) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013.

    Efficient estimation of word representations in vector space.

    In ICLR.
  • Miller (1995) George A. Miller. 1995. Wordnet: A lexical database for english. Commun. ACM, 38(11):39–41.
  • Mrksic et al. (2017) Nikola Mrksic, Ivan Vulic, Diarmuid Ó Séaghdha, Ira Leviant, Roi Reichart, Milica Gasic, Anna Korhonen, and Steve J. Young. 2017. Semantic specialization of distributional word vector spaces using monolingual and cross-lingual constraints. TACL, 5:309–324.
  • Necsulescu et al. (2015) Silvia Necsulescu, Sara Mendes, David Jurgens, Núria Bel, and Roberto Navigli. 2015. Reading between the lines: Overcoming data sparsity for accurate classification of lexical relationships. In *SEM, pages 182–192.
  • Nguyen et al. (2016) Kim Anh Nguyen, Sabine Schulte im Walde, and Ngoc Thang Vu. 2016. Integrating distributional lexical contrast into word embeddings for antonym-synonym distinction. In ACL.
  • Nguyen et al. (2017) Kim Anh Nguyen, Sabine Schulte im Walde, and Ngoc Thang Vu. 2017. Distinguishing antonyms and synonyms in a pattern-based neural network. In EACL, pages 76–85.
  • Nichol et al. (2018) Alex Nichol, Joshua Achiam, and John Schulman. 2018. On first-order meta-learning algorithms. CoRR, abs/1803.02999.
  • Peters et al. (2018) Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In NAACL, pages 2227–2237.
  • Ponti et al. (2019) Edoardo Maria Ponti, Ivan Vulic, Goran Glavas, Roi Reichart, and Anna Korhonen. 2019. Cross-lingual semantic specialization via lexical relation induction. In EMNLP-IJCNLP, pages 2206–2217.
  • Roller and Erk (2016) Stephen Roller and Katrin Erk. 2016. Relations such as hypernymy: Identifying and exploiting hearst patterns in distributional vectors for lexical entailment. In EMNLP, pages 2163–2172.
  • Roller et al. (2018) Stephen Roller, Douwe Kiela, and Maximilian Nickel. 2018. Hearst patterns revisited: Automatic hypernym detection from large text corpora. In ACL, pages 358–363.
  • Santus et al. (2016a) Enrico Santus, Anna Gladkova, Stefan Evert, and Alessandro Lenci. 2016a. The cogalex-v shared task on the corpus-based identification of semantic relations. In CogALex@COLING, pages 69–79.
  • Santus et al. (2016b) Enrico Santus, Alessandro Lenci, Tin-Shing Chiu, Qin Lu, and Chu-Ren Huang. 2016b.

    Nine features in a random forest to learn taxonomical semantic relations.

    In LREC.
  • Santus et al. (2015) Enrico Santus, Frances Yung, Alessandro Lenci, and Chu-Ren Huang. 2015. Evalution 1.0: an evolving semantic dataset for training and evaluation of distributional semantic models. In LDL@ACL-IJCNLP, pages 64–69.
  • Shen et al. (2018) Jiaming Shen, Zeqiu Wu, Dongming Lei, Chao Zhang, Xiang Ren, Michelle T. Vanni, Brian M. Sadler, and Jiawei Han. 2018. Hiexpan: Task-guided taxonomy construction by hierarchical tree expansion. In KDD, pages 2180–2189.
  • Shwartz and Dagan (2016) Vered Shwartz and Ido Dagan. 2016. Path-based vs. distributional information in recognizing lexical semantic relations. In CogALex@COLING, pages 24–29.
  • Shwartz et al. (2016) Vered Shwartz, Yoav Goldberg, and Ido Dagan. 2016. Improving hypernymy detection with an integrated path-based and distributional method. In ACL.
  • Speer et al. (2017) Robyn Speer, Joshua Chin, and Catherine Havasi. 2017. Conceptnet 5.5: An open multilingual graph of general knowledge. In AAAI, pages 4444–4451.
  • Thompson et al. (2019) Brian Thompson, Rebecca Knowles, Xuan Zhang, Huda Khayrallah, Kevin Duh, and Philipp Koehn. 2019. Hablex: Human annotated bilingual lexicons for experiments in machine translation. In EMNLP-IJCNLP, pages 1382–1387.
  • Vanschoren (2018) Joaquin Vanschoren. 2018. Meta-learning: A survey. CoRR, abs/1810.03548.
  • Vylomova et al. (2016) Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and Timothy Baldwin. 2016. Take and took, gaggle and goose, book and read: Evaluating the utility of vector differences for lexical relation learning. In ACL, pages 1671–1682.
  • Wang et al. (2017a) Chengyu Wang, Xiaofeng He, and Aoying Zhou. 2017a. A short survey on taxonomy learning from text corpora: Issues, resources and recent advances. In EMNLP, pages 1190–1203.
  • Wang et al. (2019) Chengyu Wang, Xiaofeng He, and Aoying Zhou. 2019. Spherere: Distinguishing lexical relations with hyperspherical relation embeddings. In ACL, pages 1727–1737.
  • Wang et al. (2017b) Quan Wang, Zhendong Mao, Bin Wang, and Li Guo. 2017b. Knowledge graph embedding: A survey of approaches and applications. IEEE Trans. Knowl. Data Eng., 29(12):2724–2743.
  • Washio and Kato (2018a) Koki Washio and Tsuneaki Kato. 2018a. Filling missing paths: Modeling co-occurrences of word pairs and dependency paths for recognizing lexical semantic relations. In NAACL, pages 1123–1133.
  • Washio and Kato (2018b) Koki Washio and Tsuneaki Kato. 2018b. Neural latent relational analysis to capture lexical semantic relations in a vector space. In EMNLP, pages 594–600.
  • Weeds et al. (2014) Julie Weeds, Daoud Clarke, Jeremy Reffin, David J. Weir, and Bill Keller. 2014. Learning to distinguish hypernyms and co-hyponyms. In COLING, pages 2249–2259.
  • Wu and He (2019) Shanchan Wu and Yifan He. 2019. Enriching pre-trained language model with entity information for relation classification. In CIKM, pages 2361–2364.
  • Wu et al. (2012) Wentao Wu, Hongsong Li, Haixun Wang, and Kenny Qili Zhu. 2012. Probase: a probabilistic taxonomy for text understanding. In SIGMOD, pages 481–492.
  • Yang et al. (2017) Shuo Yang, Lei Zou, Zhongyuan Wang, Jun Yan, and Ji-Rong Wen. 2017. Efficiently answering technical questions - A knowledge graph approach. In AAAI, pages 3111–3118.
  • Yang et al. (2019) Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. In NeurIPS, pages 5754–5764.
  • Yao et al. (2019) Liang Yao, Chengsheng Mao, and Yuan Luo. 2019. KG-BERT: BERT for knowledge graph completion. CoRR, abs/1909.03193.
  • Zhang et al. (2019) Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang, Maosong Sun, and Qun Liu. 2019. ERNIE: enhanced language representation with informative entities. In ACL, pages 1441–1451.