Node classification  is a central task in network analysis. It is an important building block of numerous real-world applications, such as product recommendation in e-commerce websites, advertisement distribution in social networks, and protein function identification for disease diagnosis. Many research efforts have been made to develop reliable and efficient methods for node classification in networked data.
In the era of big data, massive amount of raw data in information networks is produced everyday. However, labeled data is significantly expensive and slow to acquire due to the high cost and long time of human annotations, making it difficult to train a well-generalized classifier. Moreover, in some newly-formed networks such as a protein-protein interaction network constructed by some researchers, there may be no labels at all. Hence, it would be impossible to classify the nodes with only the information of this network. To tackle these issues, a promising approach is to utilize class information from other similar or related networks to assist in classification, i.e., transfer learning on networked data [3, 4].
In this paper, we consider a cross-network node classification problem that aims to leverage a partially labeled source attributed network to facilitate node classification in another completely unlabeled or partially labeled target attributed network (Figure 1). The challenges lie in several aspects. First, there may be a significant domain divergence between the source and target networks and they may not have many attributes in common. Second, there are no cross-network edges to propagate knowledge from the source network to the target network. Third, only a small portion of nodes in the source network are labeled.
Existing network embedding methods [5, 6, 7] are insufficient to address these challenges. They first learn compact node representations to preserve network structural information, and then train a classifier with the learned representations for node classification. Most of these methods learn node representations in an unsupervised manner, and are often less effective than graph-based semi-supervised learning methods for node classification. Moreover, topology-only embedding methods cannot be easily generalized to cross-network problems due to lack of a similarity preserving component to push nodes of the same category from two networks close in the embedding space .
Graph-based semi-supervised learning methods [9, 10] have been demonstrated highly effective for node classification in a single network with only a few labeled nodes. The recently proposed graph convolutional networks (GCN)  and follow-up works such as GraphSAGE  and GAT , naturally integrate network topology, node attributes and observed node labels into an end-to-end learning framework, and achieve superior performance on node classification. However, these methods are designed for learning tasks in a single network domain and will inherently have difficulties in generalizing to another network domain that may have a substantially different attribute set.
There are some methods [13, 3] proposed to leverage the relationship between multiple networks to improve learning performance. Both EOE  and DMNE  learn embeddings for multiple networks simultaneously, but they heavily rely on the existence of cross-network connections, making them inapplicable for our problem. Currently there is little exploration of knowledge transfer across different networks to assist in learning tasks such as node classification.
. Although there are many existing domain adaptation methods for vector-based data such as images and texts (bag-of-words)[17, 18], they are not applicable for graph-structured data, as entities in a graph are highly correlated with each other which violates the assumption of independent and identically distributed (IID) data samples in each individual domain. Little research has been conducted on domain adaptation for graph-structured data. CDNE 
is the only attempt to our best knowledge, which learns transferable node embeddings for cross network learning tasks by minimizing the maximum mean discrepancy (MMD) loss. However, it cannot jointly model network structures and node attributes, which might limit its modeling capacity. Besides, it heavily relies on the preprocessing of the adjacency matrix with the positive pointwise mutual information (PPMI) matrix, which makes the sparse adjacency matrix denser and thus aggravates computational complexity due to the autoencoder-based model architecture.
To address the challenges for cross-network node classification, we propose a novel network transfer learning framework AdaGCN that is based on adversarial domain adaptation with graph convolutional networks. The idea is two-fold: to learn class discriminative node representations via graph convolutional networks, and to learn domain invariant node representations via adversarial learning. Hence, AdaGCN consists of a semi-supervised learning component and an adversarial domain adaptation component.
On one hand, the semi-supervised component is dedicated to learning discriminative node representations for classification with the available labeled data from both the source and target networks. GCN enables training a well-behaved classifier with even only a small set of labeled nodes in the source network (as shown in Section 5.2.1
). However, the original GCN layer only conducts Laplacian smoothing on nearby nodes’ features within one hop, and it requires stacking many layers to increase the smoothing level, which will greatly increase the number of trainable parameters and result in overfitting. To alleviate this issue, we propose to use an improved GCN layer designed with a smoothing strength hyperparameter, which makes the model more efficient.
On the other hand, the adversarial domain adaptation component is aimed at mitigating the distribution shift between the source and target domains to encourage knowledge transfer by learning domain invariant representations via adversarial learning. Specifically, we model the domain adaptation process as a two-player game similar to GANs , where the representation learner GCN acts as the generator for learning domain invariant node representations while a domain critic as the discriminator is optimized to distinguish node representations from the source and target networks. By combining the two components, AdaGCN can learn both class discriminative and domain invariant node representations for transferring class information across networks.
Extensive experiments on real attributed networks show that AdaGCN can work in both unsupervised setting (i.e., completely unlabeled target network) and semi-supervised setting (i.e., scarcely labeled target network). Besides, it has low dependence on the common attributes shared by the source and target networks. The main contributions of this paper can be summarized as follows:
We pioneer in studying a challenging network transfer learning problem under a realistic setting, where a partially labeled source network is utilized to assist node classification in a completely unlabeled or partially labeled target network.
We develop a novel and principled framework for network transfer learning by efficiently integrating techniques of adversarial domain adaptation and graph convolution.
We conduct extensive experiments on real-world information networks to verify the effectiveness of our model, which demonstrates its superior performance compared with state-of-the-art baselines, impressive label efficiency, and good model robustness against distribution discrepancy.
The organization of this paper is as follows. We review the literature in Section 2. We formulate the research problem in Section 3. In Section 4, a detailed description of the proposed methods is presented. Then, the experimental results and analysis are provided in Section 5. Finally, a short summary with the contributions and possible directions of future work are included in Section 6.
2 Related Work
2.1 Single Network Learning
Network embedding is aimed at learning compact node representations based on network topology only or with side information in an unsupervised manner to facilitate a range of learning tasks, such as node classification and network visualization. For topology-only embedding methods, most of existing works focused on preserving network structures and properties in embedding vectors through various techniques such as negative sampling approach [6, 5, 7], matrix factorization technique [21, 22]
and deep learning models[23, 24, 25, 26]. Most recently, regularization methods based on generative adversarial networks or adversarial training are exploited to handle noisy and incomplete networked data to improve generalization ability [27, 28, 29]. Aside from topology-only methods, many models are proposed to incorporate side information such as node attributes [30, 28, 31]. For example, ANRL 
optimizes both network structure preserving loss and feature reconstruction loss based on stacked autoencoder.
The unsupervised learning methods don’t specially tailor the latent vectors for node classification, which makes them inferior to some customized models. Semi-supervised learning methods, including those using network topology and observed labels and those combining network structures with available labels and node attributes [33, 9, 10, 34, 11, 12, 35], achieve state-of-the-art performance. Planetoid  optimizes a supervised loss and a context-preserving loss. GCN  is a deep convolutional learning paradigm for graph-structured data which nicely integrates local node attributes and graph topology in convolutional layers. GraphSAGE 
is a variant of GCN which designs different aggregation methods for feature extraction. GAT improves GCN by leveraging attention mechanism to aggregate features from the neighbors of a node with discrimination.
While these methods can be modified to cross-network learning, the distribution drift between different network domains severely hampers knowledge transfer, especially for the topology-only methods .
2.2 Multi-Network Learning
A branch of work aims to leverage the relationship between multiple networks to facilitate learning, including those relying on inter-network edges [14, 13], those focusing on identifying common nodes across networks [36, 37], and those managing to transfer knowledge from the source network(s) to the target network(s) [38, 3, 39, 40, 4].
Both EOE  and DMNE  learn embeddings for multiple networks simultaneously. Specifically, EOE introduces a harmonious embedding matrix to model inter-network node similarities, while DMNE adapts autoencoder for multi-network embedding with a co-regularized loss to model cross-network relationships. However, these methods heavily rely on the existence of cross-network connections, which makes them inapplicable for our problem. Another line of related research is network alignment [36, 37], which aims to identify the node correspondence across networks with/without cross-network edges. It differs from our problem in the assumption of common nodes across networks and the research goal of finding common nodes across networks.
There is also some literature focusing on transferring knowledge from the source network(s) to the target network(s) for various tasks, such as social ties inference , positive/negative link prediction , and node classification [3, 42]. In this paper, we aim to utilize knowledge in the source network to assist classification in the target network as [3, 42, 4]. In , non-negative matrix factorization is jointly applied on the label propagation matrices of both the source and target networks so as to learn transferable structure features. However, it suffers from expensive computation in the matrix decomposition process, and it cannot jointly model the relationships among structural information, node attributes and node labels, which might cause negative transfer. CDNE  is closely related to our work. It first learns node embeddings for multiple networks with different stacked autoencoders and mitigates the distribution shift of node representations between networks by minimizing the MMD loss, and then trains a node classifier with the learned node representations.
2.3 Domain Adaptation
Domain adaptation is a subtopic of transfer learning, which aims to mitigate the harmful effect of domain drift when transferring knowledge from source to target [15, 16]. Approaches for domain adaptation can be classified into three groups, including the instance-based methods , parameter-based methods , and feature-based methods [45, 46]
. Among them, deep feature-based domain adaptation methods have attracted a lot of attention in recent years due to its effectiveness. They can be categoried into three branches, i.e., discrepancy-based methods[47, 45], reconstruction-based methods [48, 49], and adversarial-based methods [17, 50, 18, 51, 52].
In this paper, we are interested in adversarial-based methods. They are motivated by the theory in [53, 54], which suggests that when an algorithm cannot learn to identify the domain of given representations, such representations are good for knowledge transfer. DANN  was proposed to learn domain invariant features by formulating the problem as a minimax game similar to GANs  with a feature extractor acting as the generator and a domain classifier acting as the discriminator. Further, WDGRL  exploits the Wasserstein distance to improve the loss of DANN, which equips the model with better gradient property and more promising generalization bound. Meanwhile, MADA  and CDAN  manage to leverage discriminative information from label classifier to facilitate the alignment of multimodal distributions from different domains. In this paper, we leverage these adversarial-based techniques for domain adaptation on graph-structured data. The main difference is that the majority of previous methods are proposed for vector-based data such as image and text with the assumption of IID samples within each domain, while here we aim to explore domain adaptation for graph-structured data that has complicated correlations among data entities.
3 Problem Definition
In this paper, we study domain adaptation for networked data, i.e., leveraging the information of a source network to assist node classification in a completely unlabeled or partially labeled target network. The source network can be either partially labeled or fully labeled. In this section, we formally define the research problem and introduce notations used throughout the paper as summarized in Table 1.
Denote by the source network, where is the node set (), is the weighted adjacency matrix with quantifying the strength of connection between nodes and , and is the feature matrix with as the number of node attributes in and the -th row of as the feature vector associated with node . Denote by the set of labeled nodes in and the label matrix of , where if node is associated with label and otherwise.
Similarly, the target network is represented as , where is the node set , is the weighted adjacency matrix, and is the feature matrix with as the number of node attributes in . The target network can be either completely unlabeled or partially labeled. Here, we assume that it is completely unlabeled for simplicity, but our method can be straightforwardly extended to the partially labeled setting and we have conducted experiments for both scenarios in Sections 5.2 and 5.3.
The source network and the target network may contain different attributes. Denote by and the set of node attributes in and respectively. We construct a new attribute set , where represents the total number of attributes. We then reformulate the feature matrix of both and to make them include all the attributes in . With a slight abuse of notation, we still use and to represent the newly formed feature matrices of and . In particular, () represents the value of the -th attribute associated with node in and indicates that it is not associated with node .
Define a network domain as , which includes an attributed network and a function for the node classification task. Then, the source network domain and the target network domain can be represented by and , respectively. The problem considered in this paper is similar to the conventional domain adaptation problem as in [15, 16]. Specifically, there exists a domain divergence between the source and target networks, i.e., , but the label space is the same, and our goal is to learn a classifier to accurately classify the nodes in the target network with the assistance of the partially labeled source network.
4 Proposed Method
4.1 An Overview of Model Architecture
To solve cross-network node classification problem, two major challenges need to be addressed. Firstly, how to fully exploit the available data information including graph structures, node attributes and observed node labels to learn useful node representations for the two networks? Secondly, how to overcome the serious domain divergence between two networks to facilitate knowledge transfer with the absence of cross network edges and only a few common node attributes across networks?
To address the first challenge, we leverage graph convolution to integrate network topology and node attributes in a semi-supervised learning model, which is capable of learning discriminative node representations with available node labels. To tackle the second challenge, we manage to mitigate distribution discrepancy between two networks with the technique of adversarial domain adaptation. In particular, we propose a network transfer learning framework AdaGCN by naturally combining the techniques of adversarial domain adaptation and graph convolution. The model architecture is shown in Figure 2. It consists of two components: a semi-supervised learning component and an adversarial domain adaptation component. With the cooperation of them, AdaGCN can learn both class discriminative and domain invariant node representations, thus enabling classifying nodes in the target network with only a few labeled nodes in the source network. Note that our model is also applicable for the semi-supervised scenario where the target network is partially labeled.
4.2 Network Representation Learning
We propose to use graph convolution to jointly model network structures and node attributes for learning network representations, which has recently been demonstrated highly effective in various learning tasks such as node classification , graph clustering  and social recommendation .
The graph filter is a matrix designed by manipulating the spectrum of the underlying graph. The graph signal is a real-valued function defined on the nodes of the graph, i.e., each node is associated with a real number. For example, a column of the node feature matrix can be considered as a graph signal.
Graph convolution provides a principled way to combine graph structures and node features for learning useful node representations. For the graph convolutional networks (GCN) proposed in , the graph filter is a renormalized adjacency matrix, which actually performs Laplacian smoothing that updates the features of each node with a weighted average of its own and neighbors’ to obtain smooth embeddings . Further, it was shown in  that to produce smooth embeddings for nodes in the same cluster, the graph filter needs to be low-pass. With a proper low-pass graph filter, graph convolution will generate useful representations that help to ease knowledge transfer across networks and node classification in the target network.
In this paper, we propose two methods for learning network representation with graph convolution. The first one is based on the layer-wise propagation rule of GCN. Specifically, the hidden representations of the-th convolutional layer in the feature extractor are learned by:
where is a renormalized adjacency matrix with a self-loop at each node, is the output of the previous layer (), is a projection matrix with trainable parameters, and
is the activation function. As illustrated in Figure2, we use two GCNs for learning node representations for the source and target networks respectively, but they share a common set of trainable parameters () so as to help transfer knowledge across networks. For simplicity of notation, we denote a GCN as , which takes the graph adjacency matrix and the feature matrix as input, and represents the trainable parameters. Then, we can obtain the output node representations of the source and target networks as:
where as the th row of is the representation of node .
However, with the GCN layer defined as in Eq. (2), one has to stack multiple layers to increase the strength of feature smoothing, which will also increase model complexity because of the accompanied trainable parameters in each layer, and thus can easily cause overfitting, especially for learning tasks with low source training rates. To address this issue, we propose to use an improved GCN (IGCN) layer proposed in  to improve the strength of the graph convolutional filter for learning better representations. Then, for our second method, the hidden representations of the -th convolutional layer are obtained with:
where is the exponent of , i.e., the smoothing parameter. By setting an appropriate , we can easily control the smoothing strength of graph convolution to facilitate knowledge transfer and classification while avoiding overfitting. As suggested in , normally, larger should be used with lower source training rates.
4.3 Semi-Supervised Learning
In AdaGCN, the node representations of the source and target networks learned by GCNs will be fed to a classifier for label prediction, and together they form the semi-supervised learning component. The classifier could be a single layer logistic regression classifier or a multi-layer perceptron. We denote the classifier as, where represents the node representations as input and represents the trainable parameters. We then denote the prediction scores of nodes in the source and target networks as:
where are the node representations generated by GCNs and represents the prediction score for node in class . One can conduct multi-class or multi-label classification by changing the activation function of the output layer in the classifier
. For multi-class classification, the activation function can be the softmax function. For multi-label classification, the activation function is the sigmoid function. We use the cross-entropy error over all the labeled nodes in the source network as the classification loss:
where is the prediction score matrix of the labeled nodes in the source network. Note that our method can be easily extended to the semi-supervised setting by incorporting available target labels into the above cross entropy loss.
4.4 Adversarial Domain Adaptation
Domain adaptation theory [53, 54] suggests that when an algorithm cannot learn to identify the domain of given hidden representations, they are good for knowledge transfer across domains. In AdaGCN, we leverage the adversarial domain adaptation method [17, 18] to achieve this goal. Specifically, we model the domain adaptation process as a two-player game similar to GANs , where the representation learning networks is acting as the generator for learning network invariant node representations, while a domain critic acting as the discriminator is optimized to distinguish node representations from the source and target networks. After the adversarial learning, network invariant representations can be obtained, and class information can be transferred from the source network to the target network.
In the original GANs , the domain critic is a binary classifier, and the generator and the discriminator fight against each other over a log likelihood objective. However, directly formulating the problem as a binary classification problem and leveraging cross-entropy loss for model optimization may suffer from training instability such as mode collapse [60, 61]. To improve learning stability, we instead minimize the Wasserstein-1 distance between the source and target distributions of node representation as suggested in [60, 61, 18].
We set the domain critic as a fully-connected neural network that takes a node representation as input and returns a real number. Denote bythe domain critic, where is the representation of node generated by a GCN with as the input node feature matrix, and represents the trainable parameters. The first Wasserstein distance between the source and target distributions of node representation and can be computed using the Kantorovich-Rubinstein duality :
where is the Lipschitz continuity constraint. It can be interpreted as the minimum cost of transporting mass for transforming one distribution into another with the cost defined as the mass times the transport distance . We can further approximate the empirical Wasserstein distance under the 1-Lipschitz assumption by maximizing the following domain critic loss with respect to :
To enforce the Lipschitz constraint, we add a gradient penalty for the parameters of the domain critic as suggested in :
where the representation can be the source representations, the target representations, and the random points along the straight line between the source and target representation pairs. It can help avoid the capacity underuse and gradient vanishing/exploding problems of weight clipping methods  for 1-Lipschitz enforcement.
Hence, we solve the following minimax problem for learning network invariant node representations:
where is the gradient penalty coefficient, which should be set to 0 when optimizing the generator. The optimization problem suggests that the domain critic should be first trained to be optimal and then parameters in the generator are updated to minimize the Wasserstein distance between the source and target node representations.
4.5 Overall Loss and Model Training
The overall loss of the proposed model AdaGCN is as follows:
where is the coefficient for balancing semi-supervised learning and domain adaptation. We summarize the training procedure for AdaGCN in Algorithm 1. Note that here we do a full-batch training with gradient descent, but some existing methods can be applied to train the model in a mini-batch manner [63, 64]. First, as presented in line 4-10, we optimize the parameters of the domain critic via gradient descent with other model parameters fixed. Then, as shown in lines 12 and 13, we fix , and update the parameters of the generator and of the classifier by minimizing the classification loss and the domain adaptation loss simultaneously. When the model converges, we can obtain class discriminative and domain invariant node representations. To classify nodes in the target network, one can simply feed the learned node representations to the trained classifier .
The computational complexity of the model mainly consists of three parts, including the GCN layers (Eq. (2)), the label classifier (Eq. (4)) and the domain critic (Eq. (7)). It takes (suppose that , and and are the edge sets of the source and target networks) to compute hidden representations with single GCN layer for both the source and target networks through Eq. (2), which is linear to the number of edges. Note that the IGCN layer can ensure linearity with only an additional constant scale factor added to the complexity through left multiplying by repeatedly for times in Eq. (4). Obviously, the time complexity of label classifier and domain critic is linear to the number of nodes. Thus, the overall complexity of the proposed methods are linear to the size of the networks.
In this section, we aim to answer the following research questions (RQ) via experiments:
How do the proposed methods perform compared with state-of-the-art methods?
How do the training rates of the source and target networks, i.e., the ratio of labeled nodes in and , affect the transfer learning performance?
How does the distribution discrepancy between source and target networks affect the transfer learning results?
How does the strength of graph convolution affect the domain adaptation performance?
How do the hyper-parameters affect the performance of the proposed methods?
We also visualize the learned node embeddings from representation learner to provide an intuitive understanding of our proposed methods.
5.1 Experiment Setup
. DBLPv7, Citationv1 and ACMv9 are three paper citation networks from different original sources, i.e., DBLP, Microsoft Academic Graph and ACM respectively, and contain papers published in different periods, i.e., between years 2004 and 2008, before year 2008, and after year 2010, respectively. Here we consider them as undirected networks with each edge representing a citation relation between two papers. Each paper belongs to some of the following five categories according to its research topics, including “Databases”, “Artificial Intelligence”, “Computer Vision”, “Information Security”, and “Networking”. Besides, the keywords extracted from the title of each paper were utilized as its attributes in the form of bag-of-words vector. We evaluate our proposed methods by conducting multi-label classification on these three network domains through six transfer learning tasks including CD, AD, DC, AC, DA, and CA, where D, C, A denote DBLPv7, Citationv1 and ACMv9, respectively.
We select baselines from several related research lines including single network embedding methods, graph-based semi-supervised learning methods, deep domain adaptation methods, and transfer learning methods for networked data. The descriptions of them are listed as follows:
: They are single network embedding methods. Both DeepWalk and node2vec first transform network topology into node sequences, and then use skip-gram model to learn node representations. ANRL is a deep attributed network embedding model adapted from autoencoder, and we use its best variant ANRL-WAN.
GCN , GraphSAGE : They can be used for semi-supervised learning and representation learning.GCN is a deep convolutional network for graph-structured data, which integrates network topology, node attributes and observed labels into an end-to-end learning framework. GraphSAGE is a variant of GCN with different aggregation methods.
DNNs, WDGRL : These two deep models only utilize node attributes. DNNs is a multi-layer perceptron. WDGRL is a state-of-the-art adversarial domain adaptation method with the assumption of IID vector-based inputs in each domain.
5.1.3 Implementation Details
We implement our proposed methods using Tensorflow with Adam optimizer. For all transfer learning tasks, we use the same set of parameter configurations unless otherwise specified. We first describe the settings of AdaGCN. The GCNs of both the source and target networks contain three hidden layers with structure as 1000-100-16. The dropout rate for each GCN layer is set to 0.3. The classifieris a logistic regression model with sigmoid output layer for multi-label classification. The domain critic contains only one hidden layer with 16 units. A -norm regularization term is imposed on model parameters except those of with the regularization coefficient as . The domain adaptation coefficient , gradient penalty coefficient , and domain critic training step are set to 1, 10 and 10, respectively. The learning rates for both components of our method are set to
. We train the model for 1000 epochs, and perform a learning rate decaying by multiplying a decaying factor 0.8 per 100 epochs after the first 500 epochs to stabilize training. For AdaIGCN, it has similar configurations as AdaGCN with the only difference in the representation learner, which consists of only one IGCN layer and two additional fully connected layers.is set to 10 for all tasks. GCN and AdaGCN have the same settings for common hyper-parameters and model structure.
For single network embedding methods, including DeepWalk, node2vec and ANRL, node representations are first learned and then a one-vs-rest logistic regression classifier is trained with labeled nodes of both networks. For fair comparison, the dimension of node representations for these methods are all set to 128. For GraphSAGE, we adapt it to the transductive setting for better utilization of linkage information of the two networks, and use its best variant GraphSAGE-LSTM for comparison. Since these methods are designed for single network, we simply combine two networks into one and then conduct experiments as single network learning. DNNs have similar parameter settings with GCN, and WDGRL have similar parameter settings with AdaGCN. We have also tried to improve the input features of DNNs and WDGRL by augmenting the feature matrix of graph with the learned embedding vectors from DeepWalk, but found it deteriorates performance, which is explainable since the learned embeddings of the source and target networks from DeepWalk are not comparable. Experiments for NetTr and CDNE are conducted as suggested by the corresponding papers.
D: DBLPv7, C: Citationv1, A: ACMv9. The top 2 classification f1-scores are highlighted in bold for each task.
D: DBLPv7, C: Citationv1, A: ACMv9. The top 2 classification f1-scores are highlighted in bold for each task.
5.2 Performance Comparison (RQ1)
The training rate of a network is defined as , where represents the set of labeled nodes in the network. Different settings of are constructed by randomly sampling from while ensuring nodes in covering all labels. In this section, we conduct multi-label classification on three datasets with six transfer learning tasks. We consider two settings, including an unsupervised setting where only the source network is partially labeled, and a semi-supervised setting where both the source and target networks are partially labeled.
5.2.1 Unsupervised Setting: Partially Labeled Source Network and Completely Unlabeled Target Network
In the unsupervised setting, we conduct experiments with the source training rate as 10% while the target network is completely unlabeled. The experimental results are shown in Table III. It can be easily observed that our proposed method AdaGCN outperforms all the baselines in five out of six tasks, and has comparable results with the best baseline CDNE on the sixth task Citationv1ACMv9. It demonstrates the effectiveness of our proposed AdaGCN model for cross-network node classification. Specifically, there is a 4.41% relative performance improvement in Micro-F1 score and a 5.81% in Macro-F1 score over the best baseline CDNE on average across all transfer tasks. AdaIGCN can further improve AdaGCN, and outperforms all the baselines consistently in all learning tasks.
GCN and GraphSAGE have comparable performance. The proposed AdaGCN method adapts GCN for cross-network learning by combining it with domain adaptation technique. It achieves significant 13.54% and 19.03% relative gains in Micro-F1 and Macro-F1 scores respectively over GCN, which suggests that the adversarial domain adaptation component can effectively mitigate the distribution divergence of two domains and enables a successful knowledge transfer. The proposed AdaIGCN model further achieves significant 4.83% and 4.28% relative improvements in Micro-F1 and Macro-F1 scores respectively over AdaGCN on average, which shows that IGCN can learn better node representations to facilitate knowledge transfer.
We noticed that both DeepWalk and node2vec have poor performance in all transfer learning tasks as shown in Table III. The reason is that node representations used for multi-label classification are trained independently for the source and target networks since no connections between them exist. This makes the learned representations incomparable across networks, and thus the learned classifier based on source labeled data can not generalize to the target domain. Similar observations have also been made in . Therefore, single network embedding methods with only network topology as input are not directly suitable for multi-network learning. ANRL, as an attributed network embedding method, has much better performance compared with DeepWalk and node2vec, which benefits from the shared node attributes between the source and target networks. However, it is inferior to GCN by a large margin, not to mention the proposed AdaGCN method. The reasons lie in two aspects: firstly, ANRL is an unsupervised embedding method, so node classification can only be conducted after node representations have been learned, while GCN can perform semi-supervised learning in an end-to-end manner; secondly, ANRL suffers from the distribution shift between the source and target domains, while AdaGCN addresses this issue by introducing an adversarial domain adaptation component.
Both DNNs and WDGRL cannot leverage network topology information. It can be observed that the performances of DNNs and WDGRL are poor, although more available labeled nodes can help improve their performances. Besides, we noticed that WDGRL performs worse than DNNs in some tasks, which means that the domain adaptation component of WDGRL results in negative transfer. The reason might be that the distribution divergence between node attributes of two domains are too large for the adversarial domain adaptation method to work. Overall, it suggests that existing domain adaptation methods can not handle cross-network node classification problem due to their inability in leveraging network structure information. In constrast, our proposed AdaGCN method jointly models network structures and node attributes with graph convolution. The Laplacian smoothing on node features with graph convolution in the representation learner enables an easy knowledge transfer across networks.
NetTr and CDNE are two transfer learning methods for cross-network node classification. Our methods outperform NetTr by a large margin. Specifically, AdaGCN achieves remarkable 57.28% and 76.46% relative improvements over NetTr in Micro-F1 and Macro-F1 scores, respectively. One important reason is that NetTr learns transferable representations based on network topology only. Our proposed methods also produce a significant improvement over CDNE on average as mentioned before. Particularly, the relative performance gain of AdaGCN over CDNE reaches the desirable 12.47% and 10.14% in Micro-F1 and Macro-F1 scores respectively on ACMv9DBLPv7. The advantages of our methods over CDNE can be summarized into two aspects: firstly, graph convolution enables a natural combination of node attributes and network structures for representation learning, while CDNE only leverages network structures to extract features; secondly, the adversarial domain adaptation method is shown to be more effective compared with MMD in the literature .
5.2.2 Semi-Supervised Setting: Partially Labeled Source and Target Networks
In the semi-supervised setting, both the source and target networks are partially labeled with the training rates as 10% and 5%, respectively. The experimental results are shown in Table IV.
Due to the additional available labeled data in the target network, all models achieve better classification performance compared with the unsupervised setting as shown in Tables III and IV. There are many similar findings in both the unsupervised and semi-supervised scenarios, and we only highlight some new insights. Firstly, both DeepWalk and node2vec perform significantly better even though only 5% additional labeled nodes in the target network are available. It shows the effectivenss of the learned node embeddings in the target network. Both GraphSAGE and GCN have better results compared with DeepWalk and node2vec because of the proper utilization of both node attributes and network topology in learning tasks and a certain level of knowledge transfer due to the shared weights in the representation learner. AdaGCN consistently outperforms GCN across all learning tasks by a large margin, which can be attributed to the successful knowledge transfer from the source to the target network thanks to the domain adaptation component. Similarly, AdaIGCN further improves over AdaGCN with 2.40% and 2.04% relative gains in Micro-F1 and Macro-F1 scores respectively because of the improved GCN layer for alleviating overfitting. It also produces 3.14% and 3.55% relative improvements in Micro-F1 and Macro-F1 scores respectively over the best baseline CDNE.
Overall, the empirical results demonstrate that our proposed methods achieve state-of-the-art cross-network node classification performance in both the unsupervised and semi-supervised settings, thus can be potentially applied to a wide range of scenarios.
5.3 Effect of Training Rate (RQ2)
In this section, we study the effect of training rate of the source and target networks on model performance.
5.3.1 Effect of Source Training Rate
We conduct experiments with the training rate of source network ranging from 5% to 90% while the target network is completely unlabeled. The experimental results are displayed in Figure 3. Note that only some of the baselines are selected for comparison to ensure clear presentations, and only the results on tasks with DBLPv7 and Citationv1 as targets are presented here to avioid repetition. We have the following observations:
Our proposed methods, including AdaGCN and AdaIGCN, consistently outperform all the baselines on these four tasks for all training rates, which demonstrates their effectiveness for knowledge transfer across networks. AdaIGCN performs better than AdaGCN, especially when the source training rate is low. It validates that the utilization of IGCN layer can help alleviate the overfitting issue and facilitate knowledge transfer.
For almost all baselines except DeepWalk, the performance first improves, and then becomes stable with the increase of source training rate. For our proposed AdaIGCN, it shows remarkably good performance even with only 5% labeled nodes in the source network, which suggests its high label efficiency.
We noticed that the performance of DeepWalk decreases as the source training rate increases. It actually further confirms our finding that single network embedding methods based on topology only are not applicable for cross-network learning due to the incomparable node representations for two networks. Similar results can also be observed for node2vec which are not shown here.
5.3.2 Effect of Target Training Rate
We investigate the effect of target training rate by varying it from 1% to 10% while fixing source training rate as 10%. We only show the Micro-F1 scores in Figure 4 on learning tasks Citationv1DBLPv7 and DBLPv7Citationv1 for succinct presentation. We have the following observations. Firstly, AdaGCN significantly and consistently outperforms GCN on both learning tasks for all target training rates, which means that the adversarial domain adaptation component can successfully mitigate the distribution discrepancy between two domains and help knowledge transfer across networks. Specifically, AdaGCN exhibits an impressive 5.08% relative improvement over GCN on average. Secondly, AdaIGCN further achieves improvements upon AdaGCN consistently, and the gap is more significant with low target training rate. In particular, it produces a 3.90% relative gain over AdaGCN on Citationv1DBLPv7 when the target training rate is 1%. It proves that the improved GCN layer can make a good balance between the strength of Laplacian smoothing and model complexity. Overall, it demonstrates the effectiveness of our proposed methods for network transfer learning in the semi-supervised setting.
5.4 Effect of Distribution Discrepancy (RQ3)
In this section, we explore the effect of distribution discrepancy between the source and target networks on domain adaptation. We define common attribute rate between the source and target networks as , and we randomly delete some of the common attributes of two networks to change in the experiments. Lower means larger distribution discrepancy. We conduct multi-label classification on tasks with Citationv1 as target with varying in the unsupervised setting where the source training rate is 10%, and compare the performance of GCN, AdaGCN and AdaIGCN.
Figure 5 displays the experimental results when ranges from 10% to 50%. Both AdaGCN and AdaIGCN consistently outperform GCN across all common attribute rates for both transfer tasks. More specifically, AdaGCN achieves 22.88% and 24.93% relative gains on Micro-F1 and Macro-F1 scores respectively over GCN for DBLPv7Citationv1, and 9.02% and 14.56% for ACMv9Citationv1. It demonstrates that the adversarial domain adaptation component contributes to the classification performance even when the source and target networks only share a very small proportion of attributes. Besides, AdaIGCN performs better than AdaGCN consistently, which further confirms that the IGCN layer can learn better node representations for domain adaptation. In summary, the proposed methods are very robust and can work well under large distribution discrepancy between the source and target networks, which enables their applications for solving a wide range of real-world problems.
5.5 Effect of Graph Convolution (RQ4)
In this section, we vary the smoothing parameter of the IGCN layer in AdaIGCN from 1 to 25 to study the effect of graph convolution on domain adaptation. Note that AdaIGCN can be reduced to WDGRL when , i.e., no smoothing on node features. The experiments are conducted in the unsupervised setting with source training rate as 10%. We present the experimental results on DBLPv7Citationv1 in Figure 6(a). We can observe that graph convolution on node features brings extraordinary improvements to node classification performance on the target network, since there is a remarkable 26.62% relative improvement when increasing from 0 to 1. When varying from 1 to 25, the classification accuracy first increases and then slightly drops. It shows that appropriate setting of can help further facilitate knowledge transfer, but too large can result in over-smoothing of node features and thus harm the transfer performance. Specifically, features of neighborhood nodes become similar with Laplacian smoothing in the graph convolutional layer, and a large smoothing parameter can make them converge to very similar value and blur the class boundaries. On the whole, graph convolution plays a crucial role for the successful knowledge transfer across networks in our proposed framework.
5.6 Parameter Sensitivity (RQ5)
In this section, we perform sensitivity analysis of AdaGCN on domain adaptation coefficient , gradient penalty coefficient , and domain critic training step . The experiments are conducted in the unsupervised setting with source training rate as 10%. It is expected to shed some lights on how to configure these hyper-parameters. Here we only present the Micro-F1 score for Citationv1DBLPv7 to avoid repetition, and similar tendency can be observed in other tasks. Note that when studying one hyper-parameter, we fix all others with default settings mentioned in Section 5.1.3.
is a coefficient for balancing the semi-supervised loss and domain adaptation loss. We can find that the performance slightly improves with the increase of from 0.4 to 1.2, and then drops quickly afterwards as shown in Figure 6(b). It suggests that it is important to maintain the balance between the two parts so as to learn both class discriminative and domain invariant representations. is a hyper-parameter for controlling the weight of gradient penalty when training the discriminator of the adversarial domain adaptation component. From Figure 6(c), it can be observed that the best result is obtained when is set to 10, and smaller or larger configurations might result in performance degradation, which is consistent with the finding in , and thus 10 would be a recommended setting. Theoretically, the domain critic network should be trained to optimality by optimizing its own parameters while fixing those of other components, and thus the training step should be set to a large enough number for this purpose. From Figure 6(d), it can be noticed that the Micro-F1 score shows apparent increase when increasing from 5 to 10, and then becomes stable, which is consistent with our theoretic analysis.
5.7 Visualization of Node Representations
Figure 7 visualizes the node representations generated by GCN, IGCN, AdaGCN and AdaIGCN in the unsupervised setting for ACMv9Citationv1 using t-SNE  where the source network is fully labeled. We only visualize nodes from “Databases” and “Computer Vision” for clear presentation. The gray and orange points represent papers of “Databases” and “Computer Vision” respectively from ACMv9, while red and green points represent papers of “Databases” and “Computer Vision” from Citationv1.
On one hand, the domain adaptation component helps mitigate domain divergence and benefits knowledge transfer. Specifically, from Figures 7(a) and 7(b), it can be observed that both the GCN and IGCN models suffer from distribution shift between different networks, since nodes from different categories, e.g., green and gray points, are mixed together. In contrast, from Figures 7(c) and 7(d), we can find that gray and red points are clustered together, while orange and green nodes are clustered together. It demonstrates that the adversarial domain adaptation successfully mitigates the distribution divergence between the source and target networks, since papers from the same categories of both domains are well clustered together. Besides, the boundary between these two clusters are quite clear, which means that the learned node representations are discriminative. On the other hand, the IGCN layer also brings two significant advantages. Firstly, the IGCN layer allows adjusting the smoothing strength on node features without increasing model complexity, and an appropriate smoothing of node features helps to learn more compact node representations within the same category as shown in Figures 7(b) and 7(d), thus contributes to the classification task. Furthermore, it makes the domain adaptation process easier, which is confirmed by the visualization results that AdaIGCN aligns the source and target node representations better than AdaGCN as shown in Figures 7(c) and 7(d).
In this paper, we successfully address the cross-network node classification problem by proposing a novel network transfer learning framework AdaGCN, which leverages the techniques of adversarial domain adaptation and graph convolution. It can learn both class discriminative and network invariant node representations with the help of a semi-supervised learning (SSL) component and an adversarial domain adaptation (ADA) component. The SSL component is capable of learning a well-generalized node classifier with graph convolutional layers for representation learning, while the ADA component ensures successful knowledge transfer from the source network to the target network through adversarial learning. Together they enable AdaGCN to work well in real-world attributed networks under a realistic setting.
The research of transfer learning on networked data is still in an early stage, and much more efforts are needed. This paper serves as a step further in this direction. Future work will include investigating knowledge transfer from multiple source networks to a target network and exploring conditional adversarial domain adaptation for better alignment of multimodal data distribution.
This research was partially supported by HK ITF UIM/363 and the grants 1-ZVJJ and G-YBXV funded by the Hong Kong Polytechnic University.
-  S. Bhagat, G. Cormode, and S. Muthukrishnan, “Node classification in social networks,” in Social Network Data Analytics, 2011, pp. 115–148.
-  J. Shu, Z. Xu, and D. Meng, “Small sample learning in big data era,” CoRR, vol. abs/1808.04572, 2018.
-  M. Fang, J. Yin, and X. Zhu, “Transfer learning across networks for collective classification,” in ICDM, 2013, pp. 161–170.
-  X. Shen and F. Chung, “Network embedding for cross-network node classification,” in arXiv:1901.07264, 2019.
-  B. Perozzi, R. Al-Rfou, and S. Skiena, “Deepwalk: online learning of social representations,” in KDD, 2014, pp. 701–710.
-  J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei, “LINE: large-scale information network embedding,” in WWW, 2015, pp. 1067–1077.
-  A. Grover and J. Leskovec, “node2vec: Scalable feature learning for networks,” in KDD, 2016, pp. 855–864.
-  M. Heimann and D. Koutra, “On generalizing neural node embedding methods to multi-network problems,” in KDD MLG Workshop, 2017.
-  Z. Yang, W. W. Cohen, and R. Salakhutdinov, “Revisiting semi-supervised learning with graph embeddings,” in ICML, 2016, pp. 40–48.
-  T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” in ICLR, 2017.
-  W. L. Hamilton, R. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” in NIPS, 2017.
-  P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio, “Graph attention networks,” CoRR, 2017.
-  J. Ni, S. Chang, X. Liu, W. Cheng, H. Chen, D. Xu, and X. Zhang, “Co-regularized deep multi-network embedding,” in WWW, 2018, pp. 469–478.
-  L. Xu, X. Wei, J. Cao, and P. S. Yu, “Embedding of embedding (EOE): joint embedding for coupled heterogeneous networks,” in WSDM, 2017, pp. 741–749.
-  S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Trans. Knowl. Data Eng., vol. 22, no. 10, pp. 1345–1359, 2010.
-  M. Wang and W. Deng, “Deep visual domain adaptation: A survey,” Neurocomputing, vol. 312, pp. 135–153, 2018.
-  Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. S. Lempitsky, “Domain-adversarial training of neural networks,” JMLR, vol. 17, pp. 59:1–59:35, 2016.
-  J. Shen, Y. Qu, W. Zhang, and Y. Yu, “Wasserstein distance guided representation learning for domain adaptation,” in AAAI, 2018, pp. 4058–4065.
-  Q. Li, X. Wu, H. Liu, X. Zhang, and Z. Guan, “Label efficient semi-supervised learning via graph filtering,” in CVPR, 2019.
-  I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and Y. Bengio, “Generative adversarial nets,” in NIPS, 2014, pp. 2672–2680.
-  S. Cao, W. Lu, and Q. Xu, “Grarep: Learning graph representations with global structural information,” in CIKM, 2015, pp. 891–900.
-  X. Wang, P. Cui, J. Wang, J. Pei, W. Zhu, and S. Yang, “Community preserving network embedding,” in AAAI, 2017, pp. 203–209.
-  S. Cao, W. Lu, and Q. Xu, “Deep neural networks for learning graph representations,” in AAAI, 2016, pp. 1145–1152.
-  D. Wang, P. Cui, and W. Zhu, “Structural deep network embedding,” in KDD, 2016, pp. 1225–1234.
-  X. Shen and F. Chung, “Deep network embedding with aggregated proximity preserving,” in ASONAM, 2017, pp. 40–43.
-  ——, “Deep network embedding for graph representation learning in signed networks,” IEEE Transactions on Cybernetics, 2018.
-  Q. Dai, Q. Li, J. Tang, and D. Wang, “Adversarial network embedding,” in AAAI, 2018.
-  S. Pan, R. Hu, G. Long, J. Jiang, L. Yao, and C. Zhang, “Adversarially regularized graph autoencoder for graph embedding,” in IJCAI, 2018.
-  Q. Dai, X. Shen, L. Zhang, Q. Li, and D. Wang, “Adversarial training methods for network embedding,” in WWW, 2019, pp. 329–339.
-  Z. Zhang, H. Yang, J. Bu, S. Zhou, P. Yu, J. Zhang, M. Ester, and C. Wang, “ANRL: attributed network representation learning via deep neural networks,” in IJCAI, 2018, pp. 3155–3161.
-  L. Xu, X. Wei, J. Cao, and P. S. Yu, “On exploring semantic meanings of links for embedding social networks,” in WWW, 2018, pp. 479–488.
-  C. Tu, W. Zhang, Z. Liu, and M. Sun, “Max-margin deepwalk: Discriminative learning of network representation,” in IJCAI, 2016.
-  S. Pan, J. Wu, X. Zhu, C. Zhang, and Y. Wang, “Tri-party deep network representation,” in IJCAI, 2016, pp. 1895–1901.
-  X. Huang, J. Li, and X. Hu, “Label informed attributed network embedding,” in WSDM, 2017, pp. 731–739.
J. Liang, P. Jacobs, J. Sun, and S. Parthasarathy, “Semi-supervised embedding in attributed networks with outliers,” inSDM, 2018.
-  L. Liu, W. K. Cheung, X. Li, and L. Liao, “Aligning users across social networks using network embedding,” in IJCAI, 2016, pp. 1774–1780.
-  M. Heimann, H. Shen, T. Safavi, and D. Koutra, “REGAL: representation learning-based graph alignment,” in CIKM, 2018, pp. 117–126.
-  J. Tang, T. Lou, and J. M. Kleinberg, “Inferring social ties across heterogenous networks,” in WSDM, 2012, pp. 743–752.
-  X. Shen, F. Chung, and S. Mao, “Leveraging cross-network information for graph sparsification in influence maximization,” in SIGIR, 2017, pp. 801–804.
-  ——, “Cross-network learning with fuzzy labels for seed selection and graph sparsification in influence maximization,” IEEE Transactions on Fuzzy Systems, 2019.
-  J. Ye, H. Cheng, Z. Zhu, and M. Chen, “Predicting positive and negative links in signed social networks by transfer learning,” in WWW, 2013.
-  J. Lee, H. Kim, J. Lee, and S. Yoon, “Transfer learning for deep learning on graph-structured data,” in AAAI, 2017, pp. 2154–2160.
-  B. Tan, Y. Zhang, S. J. Pan, and Q. Yang, “Distant domain transfer learning,” in AAAI, 2017, pp. 2604–2610.
-  A. Rozantsev, M. Salzmann, and P. Fua, “Beyond sharing weights for deep domain adaptation,” CoRR, vol. abs/1603.06432, 2016.
-  M. Long, Y. Cao, J. Wang, and M. I. Jordan, “Learning transferable features with deep adaptation networks,” in ICML, 2015, pp. 97–105.
-  B. Sun and K. Saenko, “Deep CORAL: correlation alignment for deep domain adaptation,” in ECCV, 2016, pp. 443–450.
-  E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell, “Deep domain confusion: Maximizing for domain invariance,” CoRR, vol. abs/1412.3474, 2014.
-  F. Zhuang, X. Cheng, P. Luo, S. J. Pan, and Q. He, “Supervised representation learning: Transfer learning with deep autoencoders,” in IJCAI, 2015, pp. 4119–4125.
-  T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim, “Learning to discover cross-domain relations with generative adversarial networks,” in ICML, 2017, pp. 1857–1865.
-  E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell, “Adversarial discriminative domain adaptation,” in CVPR, 2017, pp. 2962–2971.
-  Z. Pei, Z. Cao, M. Long, and J. Wang, “Multi-adversarial domain adaptation,” in AAAI, 2018, pp. 3934–3941.
-  M. Long, Z. Cao, J. Wang, and M. I. Jordan, “Conditional adversarial domain adaptation,” in NeurIPS, 2018, pp. 1647–1657.
-  S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira, “Analysis of representations for domain adaptation,” in NIPS. MIT Press, 2006, pp. 137–144.
-  S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan, “A theory of learning from different domains,” Machine Learning, vol. 79, no. 1-2, pp. 151–175, 2010.
-  X. Zhang, H. Liu, Q. Li, and X. Wu, “Attributed graph clustering via adaptive graph convolution,” in Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019, 2019, pp. 4327–4333.
-  W. Fan, Y. Ma, Q. Li, Y. He, Y. E. Zhao, J. Tang, and D. Yin, “Graph neural networks for social recommendation,” in WWW, 2019, pp. 417–426.
D. I. Shuman, S. K. Narang, P. Frossard, A. Ortega, and P. Vandergheynst, “The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains,”IEEE Signal Processing Magazine, vol. 30, no. 3, pp. 83–98, 2013.
-  A. Sandryhaila and J. M. F. Moura, “Discrete signal processing on graphs,” IEEE Trans. Signal Processing, vol. 61, no. 7, pp. 1644–1656, 2013.
-  Q. Li, Z. Han, and X. Wu, “Deeper insights into graph convolutional networks for semi-supervised learning,” in AAAI, 2018, pp. 3538–3545.
-  M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adversarial networks,” in ICML, 2017, pp. 214–223.
-  I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville, “Improved training of wasserstein gans,” in NIPS, 2017.
-  C. Villani, Optimal transport: old and new. Springer-Verlag Berlin Heidelberg, 2008, vol. 338.
J. Chen, J. Zhu, and L. Song, “Stochastic training of graph convolutional networks with variance reduction,” inICML, vol. 80, 2018, pp. 941–949.
-  J. Chen, T. Ma, and C. Xiao, “Fastgcn: Fast learning with graph convolutional networks via importance sampling,” in ICLR, 2018.
-  J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su, “Arnetminer: extraction and mining of academic social networks,” in KDD, 2008, pp. 990–998.
-  L. van der Maaten and G. Hinton, “Visualizing data using t-sne,” JMLR, vol. 9, pp. 2579–2605, 2008.