Log In Sign Up

Consistent Representation Learning for Continual Relation Extraction

Continual relation extraction (CRE) aims to continuously train a model on data with new relations while avoiding forgetting old ones. Some previous work has proved that storing a few typical samples of old relations and replaying them when learning new relations can effectively avoid forgetting. However, these memory-based methods tend to overfit the memory samples and perform poorly on imbalanced datasets. To solve these challenges, a consistent representation learning method is proposed, which maintains the stability of the relation embedding by adopting contrastive learning and knowledge distillation when replaying memory. Specifically, supervised contrastive learning based on a memory bank is first used to train each new task so that the model can effectively learn the relation representation. Then, contrastive replay is conducted of the samples in memory and makes the model retain the knowledge of historical relations through memory knowledge distillation to prevent the catastrophic forgetting of the old task. The proposed method can better learn consistent representations to alleviate forgetting effectively. Extensive experiments on FewRel and TACRED datasets show that our method significantly outperforms state-of-the-art baselines and yield strong robustness on the imbalanced dataset.


page 1

page 2

page 3

page 4


Improving Continual Relation Extraction through Prototypical Contrastive Learning

Continual relation extraction (CRE) aims to extract relations towards th...

Learning Robust Representations for Continual Relation Extraction via Adversarial Class Augmentation

Continual relation extraction (CRE) aims to continually learn new relati...

Sentence Embedding Alignment for Lifelong Relation Extraction

Conventional approaches to relation extraction usually require a fixed s...

Curriculum-Meta Learning for Order-Robust Continual Relation Extraction

Continual relation extraction is an important task that focuses on extra...

Complementary Calibration: Boosting General Continual Learning with Collaborative Distillation and Self-Supervision

General Continual Learning (GCL) aims at learning from non independent a...

Multi-Domain Multi-Task Rehearsal for Lifelong Learning

Rehearsal, seeking to remind the model by storing old knowledge in lifel...

Code Repositories


Implementation of the research paper Consistent Representation Learning for Continual Relation Extraction (Findings of ACL 2022)

view repo

1 Introduction

footnotetext: *Corresponding Author

Relation extraction (RE) is an essential issue in information extraction (IE), which can apply to many downstream NLP tasks, such as information retrieval DBLP:conf/www/XiongPC17 and question and answer DBLP:conf/ijcai/TaoGSWZY18. For example, given a sentence with the annotated entities pairs and , the RE aims to identify the relations between and . However, traditional relation extraction models zhou2016attention; soares2019matching always assume a fixed set of predefined relations and train on a fixed dataset, which cannot handle the growing relation types in real life well.

To solve this situation, continual relation extraction (CRE) is introduced wang2019sentence; han2020continual; wu2021curriculum; cui2021refining. Compared with traditional relation extraction, CRE aims to help the model learn new relations while maintaining accurate classification of old ones. wang2019sentence

shows that continual relation learning needs to alleviate the catastrophic forgetting of old tasks when the model learns new tasks. Because neural networks need to retrain a fixed set of parameters with each training, the most efficient solution to the problem of catastrophic forgetting is to store all the historical data and retrain the model with all the data each time a new relational instance appears. This method can achieve the best effect in continual relation learning, but it is not adopted in real life due to the time and computing power costs.

Some recent works have proposed a variety of methods to alleviate the catastrophic forgetting problem in continual learning, including regularization methods kirkpatrick2017overcoming; zenke2017continual; liu2018rotate, dynamic architecture methods chen2015net2net; fernando2017pathnet, and memory-based methods lopez2017gradient; chaudhry2018efficient

. Although these methods have been verified in simple image classification tasks, previous works have proved that memory-based methods are the most effective in natural language processing applications

wang2019sentence; d2019episodic. In recent years, the memory-based continual relation extraction model has made significant progress in alleviating the problem of catastrophic forgetting han2020continual; wu2021curriculum; cui2021refining. wang2019sentence proposes a mechanism for embedding sentence alignment in memory maintenance to ensure the stability of the embedding space. han2020continual introduces a multi-round joint training process for memory consolidation. But these two methods only explore the problem of catastrophic forgetting in the overall performance of the task sequence. wu2021curriculum proposes to integrate curriculum learning. Although it is possible to analyze the characteristics of each subtask and the performance of the corresponding model, it still fails to make full use of the saved sample information. cui2021refining

introduce an attention network to refine the prototype to better recover the interruption of the embedded space. However, this method will produce a bias in the classification of the old task as the new task continues to learn the classifier, which will affect the performance of the old task. Although the above method can alleviate catastrophic forgetting to a certain extent, it does not consider the consistency of relation embedding space.

Because the performance of the model of CRE is sensitive to the quality of sample embedding, it needs to ensure that the learning of new tasks will not damage the embedding of old tasks. Inspired by supervised contrastive Learning khosla2020supervised

to explicitly constrain data embeddings, a consistent representation learning method is proposed for continual relation extraction, which constrains the embedding of old tasks not to occur significantly change through supervised contrastive learning and knowledge distillation. Specifically, the example encoder first trains on the current task data through supervised contrastive learning based on memory bank, and then uses k-means to select representative samples to storage as memory after the training is completed. To relieve catastrophic forgetting, contrastive replay is used to train memorized samples. At the same time, to ensure that the embedding of historical relations does not undergo significant changes, knowledge distillation is used to make the embedding distribution of the new and old tasks consistent. In the testing phase, the nearest class mean (NCM) classifier is used to classify the test sample, which will not be affected by the deviation of the classifier.

In summary, our contributions in this paper are summarized as follows: First, a novel CRE method is proposed, which uses supervised contrastive learning and knowledge distillation to learn consistent relation representations for continual learning. Second, consistent representation learning can ensure the stability of the relational embedding space to alleviate catastrophic forgetting and make full use of stored samples. Finally, extensive experiments results on FewRel and TACRED datasets show that the proposed method is better than the latest baseline and effectively mitigates catastrophic forgetting.

2 Related Work

2.1 Continual Learning

Existing continual learning models mainly focus on three areas: (1) Regularization-based methods kirkpatrick2017overcoming; zenke2017continual impose constraints on updating neural weights important to previous tasks for relieving catastrophic forgetting. (2) Dynamic architecture methods chen2015net2net; fernando2017pathnet extends the model architecture dynamically to learn new tasks and prevent forgetting old tasks effectively. However, these methods are unsuitable for NLP applications because the model size increases dramatically with increasing tasks. (3) Memory-based methods lopez2017gradient; aljundi2018memory; chaudhry2018efficient; mai2021supervised saves some samples from old tasks and continuously learns them in new tasks to alleviate catastrophic forgetting. dong2021few proposes a simple relational distillation incremental learning framework to balance retaining old knowledge and adapting to new knowledge. yan2021dynamically proposes a new two-stage learning method that uses dynamic expandable representation for more effective incremental conceptual modelling. Among these methods, memory-based methods are the most effective in NLP tasks wang2019sentence; sun2019lamol; d2019episodic. Inspired by the success of memory-based methods in the field of NLP, we use the framework of memory replay to learn new relations that are constantly emerging.

2.2 Contrastive Learning

Contrastive learning (CL) aims to make the representations of similar samples map closer to each other in the embedded space, while that of dissimilar samples should be farther away jaiswal2021survey. In recent years, the rise of CL has made great progress in self-supervised representation learning. wu2018unsupervised; he2020momentum; li2020prototypical; chen2021exploring. The common point of these works is that no labels are available, so positive and negative pairs were formed through data augmentations. Recently, supervised contrastive learning khosla2020supervised has received much attention, which uses label information to extend contrastive learning. hendrycks2019benchmarking

compares the supervised contrastive loss with the cross-entropy loss on the ImageNet-C dataset, and verifies that the supervised contrastive loss is not sensitive to the hyperparameter settings of the optimizer or data enhancement.

chen2020simple proposed a contrastive learning framework for visual representations that does not require a special architecture or memory bank. khosla2020supervised extend the self-supervised batch contrastive approach to the fully-supervised setting, which use supervised contrastive loss learning better represetation. liu2020hybrid proposed a hybrid discriminant-generative training method based on an energy model. In this paper, contrastive learning is applied to continual relation extraction to extract better relation representation.

3 Methodology

3.1 Problem Formulation

In continual relation extraction, given a series of tasks , where the k-th task has its own training set and relation set . Each task is a traditional supervised classification task, including a series of examples and their corresponding labels , where is the input data, including the natural language text and entity pair, and is the relation label. The goal of continual relation learning is to train the model, which keeps learning new tasks while avoiding catastrophic forgetting of previous learning tasks. In other words, after learning the -th task, the model can identify the relation of a given entity pair into , where is the relation set already observed till the -th task.

In order to mitigate catastrophic forgetting in continual relational extraction, episodic memory modules have been used in previous work wang2019sentence; han2020continual; cui2021refining, to store small samples in historical tasks. Inspired by cui2021refining, we store several representative samples for each relation. Therefore, the episodic memory module for the observed relations in is , where , represents a certain relation, and is sample number (memory size).

0:    The training set of of the -th task, encoder , projection head , history memory , current relation set , history relation set
0:    encoder , history memory , history relation set
1:  if  is not the first task then
2:     get memory knowledge with on ;
3:  end if
4:   ;
5:  for  to  do
6:     for   do
7:        Sample from ;
8:        Update and with ;
9:        Update ;
10:     end for
11:  end for
12:  Select informative examples from to store into
13:  ;
14:  ;
15:  if  is not the first task then
16:      ;
17:     for  to  do
18:        for   do
19:           Sample from ;
20:           Update and with and ;
21:           Update ;
22:        end for
23:     end for
24:     Select informative examples from to store into ;
26:  end if
27:  return  , , ;
Algorithm 1 Training procedure for
Figure 1: Framwork of consistent representation learning.

3.2 Framework

The consistent representation learning (CRL) in the current task is described in Algorithm 1, which consists of three main steps: (1) Init training for new task (line ): The parameters of the encoder and projector head are trained on the training sample in with supervised contrastive learning. (2) Sample selection (line ): For each relation , we retrieve all samples labeled from . Then, the k-means algorithm is used to cluster the samples. The relation representation of the sample closest to the center is selected and stored in memory for each cluster. (3) Consistent representation learning (): In order to keep the embedding of historical relations in space consistent after learning new tasks, we perform contrastive replay and knowledge distillation constraints on the samples in memory.

3.3 Encoder

The key of CRE is to obtain a better relation representation. The pre-trained language model BERT DBLP:conf/naacl/DevlinCLT19 shows a powerful ability in extracting contextual representation of text. Therefore, BERT is used to encode entity pairs and context information to get the relational representation.

Given a sentence and a pair of entities , we follow DBLP:conf/acl/SoaresFLK19 augment with four reserved word pieces to mark the begin and end of each entity mentioned in the sentence. The new token sequence is fed into BERT instead of . To get the final relation representation between the two entities, the output corresponding to the positions of and

are concatenated, and then map it to a high-dimensional hidden representation

, as follows:


where and are trainable parameters. The encoder in which the above-mentioned encoded sentence is a relation representation is denoted as .

Then, we use a projection head to obtain the low-dimensional embedding:


where is composed of two layers of neural networks. The normalized embedding is used for contrastive learning, and the hidden representation is used for classification.

3.4 Inital training for new task

Before training for each new task , we first use Encoder to extract the embedding of the relational representation of each sentence in , and use them as the initialized memory bank :


At the beginning of training, relation representation extraction is performed on each batch . Then the data embedding is explicitly constrained by clustering through supervised contrastive learning khosla2020supervised:


where is the set of indices of . represents the indices set of randomly sampled partial samples from . is the indices set that is the same as the label in , and is its cardinality. is an adjustable temperature parameter controling the separation of classes, the indicates the dot product.

After backpropagating the gradient of loss on each batch, we update the representation in the memory bank:


where is the corresponding index set of this batch of samples in . After training set training, the model can learn a better relation representation.

3.5 Selecting Typical Samples for Memory

In order to make the model not forget the relevant knowledge of the old task when it learns the new task, some samples need to be stored in . Inspired by han2020continual; cui2021refining, we use k-means to cluster each relation, where the number of clusters is the number of samples that need to be stored for each class. Then, the relation representation closest to the center is selected and stored in memory for each cluster.

3.6 Consistent Representation Learning

After learning a new task, the representation of the old relation in the space may change. In order to make the encoder not change the knowledge of the old task while learning the new task, we propose two replay strategies to learn consistent representation for alleviating this problem: contrastive replay and knowledge distillation. Figure 1 shows the main flow of consistent representation learning.

Contrastive Replay with Memory Bank

After the new task learning is over, we use the new task to train the encoder to further train the encoder by replaying the samples stored in memory . After the learning of the current task is over, we use the same method in Section 3.4 to replay the samples stored in memory .

The difference here is that each batch uses all the samples in the entire memory bank for contrastive learning, as follows:


where represents the set of indices of all samples in . is the memory bank, which stores the normalized representation of all samples in .

By replaying the samples in memory, the encoder can alleviate the forgetting of previously learned knowledge, and at the same time, consolidate the knowledge learned in the current task. However, contrastive replay allows the encoder to train on a small number of samples, which risks overfitting. On the other hand, it may change the distribution of relations in the previous task. Therefore, we propose knowledge distillation to make up for this shortcoming.

Knowledge Distillation for Relieve Forgetting

We hope that the model can retain the semantic knowledge between relations in historical tasks. Therefore, before the encoder is trained on a task, we use the similarity metric between the relations in memory as Memory Knowledge. Then use the knowledge distillation to relieve the model from forgetting this knowledge.

Specifically, the samples in the memory are encoded first, and then the prototype of each class is calculated:


where is the number of memory size, is the relation representation belonging to class

. Then, the cosine similarity between the classes is calculated to represent the knowledge learned in the memory:


where is the cosine similarity between prototype and .

When performing memory replay, we use KL divergence to make the encoder retain the knowledge of the old task.


where is the metric distribution of the prototype before training, and . Similarly, is the metric distribution of calculate the temporary prototype from the memory bank during training, and . is the Embedding Knowledge of the memory , which is the cosine similarity between temporary prototypes. The temporary prototype is dynamically calculated in each batch based on the memory bank .

Model T1 T2 T3 T4 T5 T6 T7 T8 T9 T10
EA-EMR 89.0 69.0 59.1 54.2 47.8 46.1 43.1 40.7 38.6 35.2
EMAR 88.5 73.2 66.6 63.8 55.8 54.3 52.9 50.9 48.8 46.3
CML 91.2 74.8 68.2 58.2 53.7 50.4 47.8 44.4 43.1 39.7
EMAR+BERT 98.8 89.1 89.5 85.7 83.6 84.8 79.3 80.0 77.1 73.8
RP-CRE 97.9 92.7 91.6 89.2 88.4 86.8 85.1 84.1 82.2 81.5
RP-CRE† 98.4 95.2 93.1 91.4 90.8 88.8 87.6 86.8 85.2 83.9
CRL 98.3 95.4 93.4 92.0 91.0 89.7 88.3 87.0 85.6 84.4
w/o KL 98.3 95.2 93.1 91.5 90.4 89.0 87.7 86.3 84.9 83.4
w/o CR 98.3 94.8 92.2 90.7 89.4 87.6 86.5 85.0 83.7 82.0
Model T1 T2 T3 T4 T5 T6 T7 T8 T9 T10
EA-EMR 47.5 40.1 38.3 29.9 24 27.3 26.9 25.8 22.9 19.8
EMAR 73.6 57.0 48.3 42.3 37.7 34.0 32.6 30.0 27.6 25.1
CML 57.2 51.4 41.3 39.3 35.9 28.9 27.3 26.9 24.8 23.4
EMAR+BERT 96.6 85.7 81 78.6 73.9 72.3 71.7 72.2 72.6 71.0
RP-CRE 97.6 90.6 86.1 82.4 79.8 77.2 75.1 73.7 72.4 72.4
RP-CRE† 97.8 92.3 91.0 87.3 84.2 82.7 79.8 78.8 78.6 77.3
CRL 98.1 94.7 91.6 87.0 86.3 84.5 82.9 81.8 81.8 80.7
w/o KL 98.1 94.2 91.7 87.1 86.6 84.4 82.2 81.5 81.0 80.1
w/o CR 98.1 93.2 90.1 85.8 83.2 81.2 79.4 77.4 76.8 75.9
Table 1: Accuracy (%) on all observed relations (which will continue to accumlate over time) at the stage of learning current task. The method marked by

represents the results generated from open source code

and the other baseline results copied from the original paper cui2021refining

3.7 NCM for Prediction

To predict a label for a test sample , the nearest class mean (NCM) mai2021supervised compares the embedding of with all the prototypes of memory and assigns the class label with the most similar prototype:


where is stored sample, and is a predicted label. Since the NCM classifier compares the embedding of the test sample with prototypes, it does not require an additional FC layer. Therefore, new classes can be added without any architecture modification.

4 Experiments

4.1 Datasets

Our experiments are conducted on two benchmark datasets: in the experiment, the training-test-validation that the split ratio is 3:1:1.



han-etal-2018-fewrel It is a RE dataset that contains 80 relations, each with 700 instances. Following the experimental settings by wang2019sentence, the original train and valid set of FewRel are used for experimental, which contains 80 classes.


zhang2017position It is a large-scale RE dataset containing 42 relations (including no relations) and 106,264 samples, built on news networks and online documents. Compared with FewRel, the samples in TACRED are imbalanced. Following cui2021refining, the number of training samples for each relation is limited to 320 and the number of test samples of relation to 40.

4.2 Evaluation Metrics

Average accuracy is a better measure of the effect of catastrophic forgetting because it emphasizes the model’s performance on earlier tasks han2020continual; cui2021refining. This paper evaluates the model by using the average accuracy of tasks at each step.

4.3 Baselines

We evaluate CRL and several baselines on benchmarks for comparison:

(1) EA-EMR wang2019sentence introduced a memory replay and embedding alignment mechanism to maintain memory and alleviate embedding distortion during training for new tasks.

(2) EMAR han2020continual constructs a memory activation and reconsolidation mechanism to alleviate the catastrophic forgetting problem in CRE.

(3) CML wu2021curriculum proposed a curriculum-meta learning method to alleviate the order sensitivity and catastrophic forgetting in CRE.

(4) RP-CRE cui2021refining achieves enhanced performance by utilizing relation prototypes to refine sample embeddings, thereby effectively avoiding catastrophic forgetting.

(a) Results on FewRel.
(b) Results on TACRED.
Figure 2: Comparison of model’s dependence on memory size, it shows that our model has a light dependence on memory size. The X-axis is the serial ID of the current task, Y-axis is the accuracy of the standard model on the test set from all observed relations at current stage.

4.4 Training Details and Parameters Setting

A completely random sampling strategy at the relation level is adopted. It simulates ten tasks by randomly dividing all relations of the dataset into 10 sets to simulate 10 tasks, as suggested in cui2021refining. For a fair comparison, we set the random seed of the experiment to be the same as the seed in cui2021refining, so that the task sequence is exactly the same. Note that our reproduced model RP-CRE and CRL use strictly the same experimental environment. In order to facilitate the reproduction of our experimental results, the proposed method source code and detailed hyperparameters are provided on Github.


4.5 Results and Discussion

Table 1 shows the results of the proposed methods and baselines ones compared on two datasets, where RP-CRE is reproduced under the same conditions based on open source code. We also ablated knowledge distillation and contrastive replay for consistent representation learning. CRL (w/o KL) and CRL (w/o CR) respectively refer to removing knowledge distillation loss and contrastive replay loss when replaying memory. From the table, some conclusions can be drawn:

(1) Our proposed CRL is significantly better than other baselines and achieves state-of-the-art performance in the vast majority of settings. Compared with RP-CRE, our model also produces apparent advantages. It proves that CRL can learn better consistent relation representations and is more stable in the process of continual learning.

(2) It is observed that all baselines perform worse on the TACRED dataset. The primary reason for this result is that TACRED is an imbalanced dataset. However, our model performs better than RP-CRE’s last task on TACRED (3.4% higher than RP-CRE), which is more significant than the improvement (0.5%) on the class-balanced dataset FewRel. It shows that our model is more robust to scenarios with class-imbalanced.

(3) Comparing CRL and CRL (w/o KL), not adopting knowledge distillation during training can cause the model to drop 1% and 0.6% on FewRel and TACRED, respectively. The experimental results show that knowledge distillation can uniformly alleviate the model’s forgetting of previous knowledge to learn a better consistent representation.

(4) Comparing CRL and CRL (w/o CR), removing L during memory replay caused the model to drop 2.4% and 4.8% on FewRel and TACRED, respectively. The reason for the significant drop is that only adopting cannot make the model review the samples of the current task, which leads to overfitting in the historical relations during replay.

Figure 3: A visualization of relation represetation learnted from task 1 test set by RP-CRE and CRL at different task.

4.6 Effect of Memory Size

The memory size is the number of memory samples needed for each relation. In this section, we will study the impact of memory size on the performance of our model and RP-CRE. We compare three memory sizes: 5, 10, and 20. The experimental results are shown in Figure 2.

We choose RP-CRE as the main competitor, where all configurations and task sequence remain unchanged. (1) As the size of the memory decreases, the performance of the model tends to decline, which shows that the size of the memory is a key factor that affects continuous learning and learning. But our model is more stable than RP-CRE (the performance gap in the final task), especially on the TACRED dataset. (2) On both FewRel and TACRED, CRL keeps the best performance under different memory sizes and produces obvious advantages in small memory. It indicates that utilizing consistent representation learning is a more effective way to utilize memory than the existing memory-based CRE method.

4.7 Effect of Consistent Representation Learning

In order to explore the long-term effects of consistency representation learning in continual relation extraction, we tested our model and RP-CRE on TACRED to observe the changes in the embedding space of old tasks as new tasks continue to increase. The model performs feature extraction on all samples in the test set in task 1 at the end of tasks 1, 4, 7, and 10. Then t-SNE is used to represent the dimensionality reduction relation representation. All samples on the test set of task 1 are drawn, where different color points represent different ground-truth labels. The visualization results are shown in Figure


From Figure 3, we can see that although the relation embeddings of RP-CRE are clustered and separated in each class after prototype refinement, as new tasks are continuously learned, the data embedding of task 1 is obviously scattered. In contrast, our model retains a good separation between classes, while the data embedding within classes is compact and has a certain diversity. In addition, we can see that our model has relatively stable changes in the distribution of different classes in task 1, and retains the knowledge of historical tasks with training. This is mainly because our model learns through supervised comparison, and explicitly emphasizes that the samples in historical memory are compact within the class and far away from each other. And the knowledge of historical memory is preserved through the distillation of memory knowledge. Because knowledge distillation preserves the distance distribution between classes, it can make up for the contrastive learning to over-optimize the distance between classes to prevent overfitting.

5 Conclusions and Future Work

This paper proposes a novel consistent representation learning method for the CRE task, mainly through contrastive learning and knowledge distillation when replaying memory. Specifically, we use supervised comparative learning based on a memory bank to train each new task so that the model can effectively learn the feature representation. In addition, in order to prevent the catastrophic forgetting of the old task, we compare and replay the memory samples, and at the same time, make the model retain the knowledge of the relation between the historical tasks through the knowledge distillation. Our method can better learn consistent representations to alleviate catastrophic forgetting effectively. Extensive experiments on two benchmark data sets show that our method significantly improves the performance of the most advanced technology and demonstrates powerful representation learning capabilities. In the future, we will continue to study cross-domain continual relation extraction to acquire ever-increasing knowledge.


This paper is funded by National Natural Science Foundation of China (Grant No. 62173195) and Natural Science Foundation of Hebei Province, China (pre-research No. F2022208006). We would like to thank the anonymous reviewers for their valuable feedback.