Applying deep learning methods for graph data analytics tasks has recently generated a significant research interest. Plenty of models have been developed to tackle node classification, link prediction, and broader graph classifications. Earlier efforts [20, 9] mainly focused on representing nodes in networks/graphs111Whenever there is no ambiguity, the terms networks and graphs will be used interchangeably throughout this paper.
into a low-dimensional vector space, while preserving both the topological structure and the node attributes information in an unsupervised manner. However, researchers have recently shifted from developing sophisticated deep learning models on Euclidean-like domains (e.g., image, text), to non-Euclidean data (e.g., graph structure data). This, in turn, resulted in many notable Graph Neural Networks (GNNs) models – e.g., GCN, GraphSAGE , SGC .
Despite significant breakthroughs achieved in GNNs, the existing models – in both static and dynamic graph settings – primarily focus on a single task. Learning multiple tasks in sequence remains an important challenge for GNNs models. A natural question is that how do these popular GNNs models perform on learning a series of graph-related tasks, which is termed Continual Graph Learning (CGL) in this paper. Taking Reddit post data as an illustrative scenario (cf. Figure 1), there can be several sub-communities, and we expect to build a common node classifier across all of them, rather than one for the entire graph. Towards that, we train the classifier using every sub-community in a sequential way, and the learning process on each sub-community is considered a task. However, this kind of a training process can easily lead to the phenomenon known as catastrophic forgetting (cf. the bottom of Figure 1), where the classifier is updated and overwritten after a new task is learned – which is likely to result in a significant drop on the performance of previous tasks. Note that the classes/labels in one task are different from those in other tasks, therefore, this learning process is often perceived as the task-incremental learning .
Continual learning, also referred to as lifelong learning, sequential learning or incremental learning, has recently drawn significant research attention. Its objective is to gradually extend the acquired knowledge and utilize it for future learning, similarly to human intelligence . Continual learning focuses on learning multiple tasks sequentially, targeting two general goals: (i) learning a new task while not leading to catastrophic forgetting of former tasks [19, 7], and (ii) the model can leverage knowledge from prior tasks to facilitate the learning of new tasks. Catastrophic forgetting is a direct outcome of a more general problem in neural networks, the so-called “stability-plasticity” dilemma 
. While stability indicates the preservation of previously acquired knowledge, plasticity refers to the ability to integrate new knowledge. This stability-plasticity trade-off is an essential aspect of both artificial and biological neural intelligent systems. Existing studies in continual learning mainly fall into image classification tasks and reinforcement learning, which have yielded several successful methods – e.g., iCaRL, GEM , EWC , SI , LwF , and PackNet . However, despite the extensive studies and promising results, there are surprisingly few works on CGL. The two major reasons are: (i) graph (non-Euclidean data) is not independent and identically distributed data, and (ii) graphs can be irregular, noisy and exhibit more complex relations among nodes.
To bridge this gap, in this work, we target the continual learning setting for graph data, formulating a continual node classification problem. We also conduct an empirical investigation of catastrophic forgetting in GNNs. To our knowledge, we are among the first to analyze graph data in such a sequential learning setting. We present a novel and general Experience Replay GNN framework (ER-GNN) which stores a set of nodes as experiences in a buffer and replays them in subsequent tasks, providing the capability of learning multiple consecutive tasks and alleviating catastrophic forgetting. For the experience selection, besides two intuitive strategies, we propose a novel scheme built upon influence function  which performs quite favorably in our evaluation. In summary, we make the following contributions:
We present the continual graph learning (CGL) paradigm and formulate a new continual learning problem for node classification. The main difference from previous work is that we aim to learn multiple consecutive tasks rather than a single task.
We conduct an empirical investigation of the continual node classification task, demonstrating existing GNNs are in the dilemma of catastrophic forgetting when learning a stream of tasks in succession.
To address the catastrophic forgetting issue, we develop a generic experience replay based framework which can be easily combined with any popular GNNs model. Apart from two intuitive experience selection schemes, we propose a novel strategy based on influence function.
We conduct extensive experimental evaluations using three datasets to demonstrate the superiority of our framework over several state-of-the-art GNNs.
2 Related Work
We now review the related literature grouped into two main categories and position our work in that context by indicating the respective issues that are addressed by our contribution.
2.1 Graph Neural Networks
Recently, a wide range of graph neural networks (GNNs) have been proposed to exploit the structural information underlying graphs, to potentially benefit a variety of applications [12, 6, 24, 29]. Most of the existing GNNs can be categorized into two groups: non-spectral and spectral methods. Non-spectral methods mainly develop an aggregator to gather a set of local features [10, 23]. For spectral methods, the basic idea is to learn graph representation in the spectral domain where the learned filters are based on Laplacian matrices [5, 12]. These methods have achieved great success in several graph-based tasks (e.g., node classification). However, all these methods focus on a single task learning. In this work, we study how to learn a sequence of tasks and each task is the fundamental node classification problem.
2.2 Continual Learning
Several approaches have been proposed to tackle catastrophic forgetting over the last few years. We can roughly distinguish three lines of work: (i) experience replay based methods; (ii) regularization-based methods; (iii) parameter isolation based methods. The first line of work stores samples in their raw format or compressed in a generative model. The stored samples from previous tasks are replayed while learning new tasks to alleviate forgetting. These samples/pseudo-samples can be either used for rehearsal, approximating the joint training of previous and current task, or to constrain the optimization [16, 21]
. The second line of work proposes an extra regularization term in the loss function to consolidate previous knowledge when learning on new data[13, 15, 28]. The last line of work attempts to prevent any possible forgetting of the previous tasks via methods where different parameter subsets are dedicated to different tasks. When there is no constraint on the size of the architecture, this can be done by freezing the set of parameters learned after each previous task and growing new branches for new tasks. Alternatively, under a fixed architecture, methods proceed by identifying the parts that are used for the previous tasks and masking them out during the training of the new task [17, 26]. These methods have achieved great success in image classification tasks and reinforcement learning. However, they have not been investigated on graph data, which motivates our study in this paper. Our method belongs to the family of experience replay based methods. Furthermore, we propose a new experience selection strategy based on influence function , in addition to two intuitive experience selection schemes.
This section describes the details of our proposed general framework ER-GNN for continual node classification. We begin with the formal definition of our problem in Section 3.1, followed by the details of our ER-GNN in Section 3.2 where three experience selection strategies are presented.
3.1 Problem Definition
The settings of Continual Node Classification (i.e., Task Incremental Learning) problem assume the existence of a collection of tasks: which are encountered sequentially. Each is a node classification task – with a note that there is an inherent constraint in continual learning: once the learning of a task is completed, the data from this task is no longer available. Formally, the node classification task is defined as:
Definition 1 (Node Classification).
For each task , we have training node set and testing node set . Node classification aims at learning a task-specific classifier on which is excepted to classify each node in into correct class where is the label set and is the number of classes in this task.
In our continual learning setting, instead of focusing on a single task , we aim to learn a series of node classification tasks . Therefore, our goal is to learn a model parameterized by , that can learn these tasks successively. In particular, we expect our classifier to not only perform well on the current task but also overcome catastrophic forgetting that occurs with respect to the previous tasks.
We are inspired by a well-supported model of biological learning in human beings which suggests that neocortical neurons learn using an algorithm that is prone to catastrophic forgetting and the neocortical learning algorithm is complemented by a virtual experience system which replays memories stored in the hippocampus in order to continually reinforce tasks that have not been recently performed. We thus propose a novel and general framework dubbed ER-GNN which selects and preserves experience nodes from the current task and replays them in future tasks. The framework of our ER-GNN is outlined in Algorithm 1.
When learning a task , we acquire its training set and testing set . Subsequently, we extract examples from the experience buffer denoted as . Then we feed both the training set and the experience nodes together to our classifier . A natural loss function choice for node classification task is the cross-entropy loss function:
where is the weight factor that can be learned or predefined. Since the number of nodes in training set is larger than the size of experience buffer, we need this weight factor to balance the influence from and , averting model from favoring a certain node set. Based on our experience with the experiments, we observe that it is beneficial to use sample proportion for dynamic weight factor, i.e., , where and are the number of nodes in the training set of and the size of current experience buffer, respectively.
Subsequently, we perform parameter updates and obtain the optimal parameters by minimizing the empirical risk:
After updating the parameters we need to select certain nodes in as experience nodes and add them into the experience buffer . means choosing nodes in each class as experiences of this task which will be cached into experience buffer .
The experience selection strategy is crucial to the performance of our framework, we now turn our attention to the problem of identifying which nodes should be stored in experience buffer . In the sequel, We present three schemes based on mean of feature, coverage maximization, and influence maximization.
Mean of Feature (MF).
Intuitively, the most representative nodes in each class are the closest to the average feature vector. Similar to , for each task, we compute a prototype for each class and choose nodes which are the first closest to this prototype, to form our experiences. In some GNNs, each node has its own attribute vector and embedding vector
before the classification layer (e.g., softmax layer) – and we can obtain our prototypes built on the average attribute vector or the average embedding vector. Therefore, in these cases, we conduct mean of attribute/embedding vector to produce prototypes and choose nodes whose attribute/embedding vectors are the firstclosest to the prototype:
where is the set of training nodes in class , and is the prototype of nodes in class . Note that, even though we can calculate prototypes from embedding vectors, we save the original nodes as our experiences since we will feed these nodes to our model again when learning new tasks.
Coverage Maximization (CM).
When is small, it might be advantageous to maximize the coverage of the attribute/embedding space. Drawing inspiration from 
, we suggest that approximating a uniform distribution over all nodes from the training setin each task can facilitate choosing experience nodes. In order to maximally cover the space, we rank the nodes in each class according to the number of nodes from other classes in this task within a fixed distance :
where is the label of node , is the set of nodes coming from different classes and with their distances to smaller than , and its cardinality determines the . We can choose the first nodes in each class as our experiences based on this order. Similarly to MF, we can maximize the coverage of the attribute space or the embedding space (if we can readily get embedding vector in GNNs).
Influence Maximization (IM).
When training on each task , we can remove one node from the training set and obtain a new training set . Then we can calculate the set of optimal parameters as:
resulting in a change in model optimal parameters:
. However, estimating the influence of every removed training nodeis prohibitively expensive since we need retraining the model for each removed node. An influence function strategy to efficiently approximate this behavior was used in , the basic idea is to compute the change of optimal parameters if was upweighted by some small , which gives the new parameters:
where the influence of upweighting on the parameters is given by:
where is the Hessian matrix, and Eq. (8) shows that removing node is the same as upweighting it by . Thus, we can linearly approximate the parameter change of removing as without retraining the model. can be considered as the influence of node on task and we hypothesize that the larger the , the more representative for this task. Thus, we choose the first representative nodes in each class as our experiences and refer to 
(where the application of influence functions in computer vision is investigated) for more detailed explanations on this topic. However, to our knowledge, we are the first to incorporate influence function into continual learning settings to guide the selection of experience nodes.
Apparently, our framework ER-GNN does not impose any restriction on GNNs architecture and can be easily combined with any GNNs model. In our evaluation, we implement our ER-GNN with a vanilla GCN , forming an instance of our framework – ER-GCN.
We now present the results from the empirical evaluation of our framework for continual node classification tasks, to demonstrate its effectiveness and applicability. We begin with systematically investigating to what extent the state-of-the-art GNNs forget on learning a sequence of node classification tasks, followed by the performance lift of our ER-GNN. Subsequently, we verify the applicability of our framework and conduct an ablation study to identify each component’s contribution, before exploring the hyperparameter sensitivity of our framework.
To compare the performance of our framework against several GNNs, we conduct comprehensive experiments on three commonly used datasets: Cora , Citeseer  and Reddit . To meet the requirements of the continual graph learning setting (task-incremental setting), we construct tasks on Cora and Citeseer and each task is a -way node classification task, i.e., there are classes in each task. As regards Reddit, we generate tasks and each task is a -way node classification task due to its relatively large number of unique labels. We note that each task is a new task since classes in different tasks are completely different and the data from one task is not available, upon completing the learning for that task. The statistics of the datasets and continual task setting are shown in Table 1.
|# Node Attributes||1,433||3,703||602|
|# Total Classes||7||6||41|
|# Classes in Each Task||2||2||5|
To demonstrate the effectiveness of our proposed framework, we compare ER-GNN with the following GNNs for continual node classification tasks:
Deepwalk : Deepwalk uses local information from truncated random walks as input to learn a representation which encodes structural regularities.
Node2Vec : Node2Vec learns a mapping of nodes to a low-dimensional space of features that maximize the likelihood of preserving network neighborhoods of nodes.
GraphSAGE : GraphSAGE learns a function that generates embeddings by sampling and aggregating features from the node’s local neighborhoods.
GCN : GCN uses an efficient layer-wise propagation rule that is based on a first-order approximation of spectral convolution on the graph.
GIN : GIN develops a simple architecture that is provably the most expressive among the class of GNNs and is as powerful as the Weisfeiler-Lehman graph isomorphism test.
SGC : SGC reduces this complexity in GCN through successively removing nonlinearities and collapsing weight matrices between consecutive layers.
We implement ER-GNN with GCN, forming an instance of our framework – ER-GCN, together with our three experience selection strategies. In ER-GCN, we can readily obtain the embedding vector (before the last softmax layer). Thus, we can get ER-GCN-MF, , ER-GCN-CM, and ER-GCN-IM in which -MF, -, -CM, - and -IM represent mean of attribute, mean of embedding, attribute space coverage maximization, embedding space coverage maximization and influence maximization, respectively. We exploit Adam optimizer to learn each task and the value of learning rate and weight decay factor is 0.01 and 0.0005, respectively. The dimensionality of embedding space is set to 32. The settings of all baselines are the same as suggested in the respective original papers. To avoid memory overhead, we set the number of experiences stored in the experience buffer from each class as 1 (i.e., ).
To measure the performance in the continual graph learning setup, we take performance mean (PM) and forgetting mean (FM) as our evaluation metric. Taking Cora as an example, with learning of tasks sequentially, there are accuracy values, i.e., one for each task after learning this task; and forgetting values, i.e., the performance difference between after learning the particular task and after learning subsequent tasks. Evaluation on Citeseer is the same as Cora, while there are exceptions on Reddit. We define Micro F1 score as the performance metric since there exists an imbalance between the number of nodes from different classes, thus PM and FM are average Micro F1 Score and average difference in Micro F1 Score, respectively.
– Catastrophic Forgetting in GNNs. We first systematically evaluate the extent of catastrophic forgetting in conventional GNNs in our continual graph learning setting. Table 2 shows the results of performance comparison with baselines, from which we can clearly observe catastrophic forgetting in GNNs. FM on Cora, Citeseer and Reddit are 31+%, 24+% and 33+%, respectively. Interestingly, in some cases (e.g., Reddit), DeepWalk (or Node2Vec) performs better than other baselines in terms of FM and becomes worse regarding PM. The simple explanation is that DeepWalk (or Node2Vec) sacrifices the performance when learning new tasks (i.e., plasticity) to overcome catastrophic forgetting (i.e., stability). For the performance gap between GCN and SGC in which GCN outperforms SGC in our setting, we argue that nonlinearity and multi-layer architecture in GCN can help the model remember and acquire knowledge from previous tasks. GraphSAGE thrives in PM but is prone to forgetting since it generates embeddings by sampling nodes, and different tasks could have different sample processes. GIN is mediocre on Cora and Citeseer but performs well on Reddit because of overfitting on relatively small datasets (Cora and Citeseer), as observed in .
– Performance of ER-GNN. We now present the comparison of our framework ER-GNN against the aforementioned baselines. Table 2 shows the results of our framework in which we apply experience selection strategies. We can clearly observe that our framework decreases the FM by a significant margin without discarding the ability to learn new tasks. Concerning the comparison of node selection strategies, we implement our framework with a random node selection scheme, named ER-GCN-Random, to verify the effectiveness of our three strategies. We evaluate this method on randomly selected experience nodes and report the average performance. From Table 2, we observe that our proposed experience selection schemes outperform ER-GCN-Random, which indicates that nodes selected by our strategies are more representative and include more information from previous tasks. The IM strategy performs favorably, therefore it is a reasonable conclusion that exploiting influence function to choose nodes as experiences benefits to alleviate forgetting. Another expected finding is that () outperforms consistently ER-GCN-MF (ER-GCN-CM) due to more discriminative representations in embedding space than attribute space. An important observation is that our ER-GCN performs comparably with vanilla GCN in terms of PM, even in some cases, our ER-GCN outperforms the original GCN. This indicates that our framework does not sacrifice the plasticity since we expand the training set by augmenting the nodes in experience buffer which are from previous tasks. This property is appealing as it resembles our human intelligence, we humans can not only remember previous tasks but also exploit the knowledge from preceding tasks to facilitate learning new tasks. To provide further insight into our framework, we plot the performance evolution of the first task on Reddit in Figure 2, we omit the other cases here due to the space limitation and similar results. For clarity, we only plot the curve of SGC and GCN as these two methods perform consistently best among baselines, with the same reason for omitting ER-GCN-MF and ER-GCN-CM. From Figure 2, we can clearly find that our framework alleviates forgetting by a large margin, compared with GCN.
– Applicability of ER-GNN. To demonstrate the applicability of our framework, we instantiate our ER-GNN with several GNNs. Besides ER-GCN, we implement our framework with SGC and GIN, forming two additional instances – ER-SGC and ER-GIN. Similarly, we apply our three experience selection scheme, obtaining ER-SGC-MF, ER-SGC-CM, ER-SGC-IM, ER-GIN-MF, ER-GIN-CM, and ER-GIN-IM, respectively. The results of all ER-GNN are shown in Table 3, from which we observe the parallel phenomena in ER-SGC and ER-GIN with ER-GCN. ER-SGC (or ER-GIN) decreases the forgetting in SGC (or GIN) by a significant margin, which demonstrates the effectiveness and applicability of our framework simultaneously.
– Model Ablation and Hyperparameter Sensitivity. From Table 2 and Table 3, on one hand, ER-GCN (or ER-SGN, ER-GIN) improves the performance of continual node classification by a large margin, compared with GCN (or SGC, GIN), which verifies the effectiveness of our framework ER-GNN. On the other hand, ER-GCN performs better than ER-SGC (or ER-GIN) just as GCN surpasses SGC (or GIN) in our setting which indicates the applicability of our framework and the effect of GNN models. We believe that the performance of our ER-GNN can be improved with better networks being incorporated. The number of nodes stored in the buffer from each class and the dimensionality of the embedding space are two crucial hyperparameters, thus we conducted a study to investigate their effects and report the results on Cora in Figure 3 (we omit the other datasets due to similar results and space limitation). From Figure 3 (a), as expected, the more nodes stored in the buffer, the better the performance. However, too large will lead to additional memory overhead. Figure 3 (b) illustrates another advantage of our framework: insensitivity to the dimensionality of embedding space. Another parameter of our framework is the weight factor . Figure 4 shows that there exists a trade-off between plasticity and stability. Too large (e.g., 0.9) will sacrifice stability while too small (e.g., 0.1) will degrade the performance on new tasks (i.e., plasticity). Dynamic in our framework gets a good trade-off between stability and plasticity.
In this work, we formulated a novel graph-based continual learning problem, where the model is expected to learn a sequence of node classification tasks without catastrophic forgetting. We presented a general framework called ER-GNN that exploits the experience replay based method to mitigate the impacts of forgetting and discussed 3 experience selection schemes, including a novel one – IM (Influence Maximization) which utilizes influence function to select experience nodes. The extensive experiments demonstrated the effectiveness and applicability of our ER-GNN. As part of our future work, we are planning to extend the continual learning to different graph-related tasks, such as alignments and cascades/diffusion prediction under sequences of evolving data.
-  (2018) Riemannian walk for incremental learning: understanding forgetting and intransigence. In ECCV, Cited by: §4.
Lifelong machine learning.
Synthesis Lectures on Artificial Intelligence and Machine Learning10 (3), pp. 1–145. Cited by: §1.
-  (2016) Improved deep reinforcement learning for robotics through distribution-based experience retention. In IROS, Cited by: §3.2.
-  (2019) Continual learning: a comparative study on how to defy forgetting in classification tasks. arXiv:1909.08383. Cited by: §1.
-  (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In NIPS, Cited by: §2.1.
-  (2018) Large-scale learnable graph convolutional networks. In SIGKDD, Cited by: §2.1.
-  (2013) An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv:1312.6211. Cited by: §1.
-  (2012) Studies of mind and brain: neural principles of learning, perception, development, cognition, and motor control. Vol. 70, Springer Science & Business Media. Cited by: §1.
-  (2016) Node2vec: scalable feature learning for networks. In SIGKDD, Cited by: §1, §4.
-  (2017) Inductive representation learning on large graphs. In NIPS, Cited by: §1, §2.1, §4, §4.
-  (2011) Robust statistics: the approach based on influence functions. Vol. 196, John Wiley & Sons. Cited by: §1, §2.2.
-  (2017) Semi-supervised classification with graph convolutional networks. In ICLR, Cited by: §1, §2.1, §3.2, §4.
-  (2017) Overcoming catastrophic forgetting in neural networks. PNAS 114 (13), pp. 3521–3526. Cited by: §1, §2.2.
-  (2017) Understanding black-box predictions via influence functions. In ICML, Cited by: §3.2.
-  (2017) Learning without forgetting. TPAMI 40 (12), pp. 2935–2947. Cited by: §1, §2.2.
-  (2017) Gradient episodic memory for continual learning. In NIPS, Cited by: §1, §2.2.
-  (2018) Packnet: adding multiple tasks to a single network by iterative pruning. In CVPR, Cited by: §1, §2.2.
-  (1995) Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory.. Psychological review 102 (3), pp. 419. Cited by: §3.2.
-  (1989) Catastrophic interference in connectionist networks: the sequential learning problem. In Psychology of learning and motivation, Vol. 24, pp. 109–165. Cited by: §1.
-  (2014) Deepwalk: online learning of social representations. In SIGKDD, Cited by: §1, §4.
-  (2017) Icarl: incremental classifier and representation learning. In CVPR, Cited by: §1, §2.2, §3.2.
-  (2008) Collective classification in network data. AI magazine 29 (3), pp. 93–93. Cited by: §4.
-  (2018) Graph attention networks. In ICLR, Cited by: §2.1.
-  (2019) Simplifying graph convolutional networks. In ICML, Cited by: §1, §2.1, §4, §4.
-  (2019) A comprehensive survey on graph neural networks. arXiv:1901.00596. Cited by: §1.
-  (2018) Reinforced continual learning. In NeurIPS, Cited by: §2.2.
-  (2018) How powerful are graph neural networks?. In ICLR, Cited by: §4.
-  (2017) Continual learning through synaptic intelligence. In ICML, Cited by: §1, §2.2.
-  (2019) Heterogeneous graph neural network. In SIGKDD, Cited by: §2.1.