1. Introduction
Graph is widely used to capture rich information (e.g., hierarchical structures, communities) in data from various domains such as social networks, ecommerce networks, knowledge graphs, WWW and semantic webs. By incorporating graph topology and node/edge features into machine learning models,
graph representation learning has achieved great success in many important applications such as node classification, link prediction, and graph clustering.A large number of graph representation learning algorithms (Velickovic et al., 2019; Peng et al., 2020; Tang et al., 2015; Perozzi et al., 2014; Grover and Leskovec, 2016; Ahmed et al., 2013; Cao et al., 2015; Qiu et al., 2018; Chen et al., 2018; Kipf and Welling, 2017; Hamilton et al., 2017; Velickovic et al., 2018; Xu et al., 2019; Qu et al., 2019) have been proposed. Among them, many (Tang et al., 2015; Perozzi et al., 2014; Grover and Leskovec, 2016; Hamilton et al., 2017) are designed in an unsupervised manner and make use of “negative sampling” to learn node representations. This design shares similar ideas as contrastive learning (He et al., 2020; Tian et al., 2020, 2019; van den Oord et al., 2018; Hénaff et al., 2019; Belghazi et al., 2018; Hjelm et al., 2019; Wu et al., 2018; Mikolov et al., 2013; Asano et al., 2020; Caron et al., 2018; Chen et al., 2020), which “contrasts” the similarities of the representations of similar (or positive) node pairs against those of negative pairs. These algorithms adopt noise contrastive estimation loss (NCEloss), while they differ in the definition of node similarity (hence the design of contrastive pairs) and encoder design.
Existing graph representation learning algorithms mainly fall into three categories: adjacency matrix factorization based models (Ahmed et al., 2013; Cao et al., 2015; Qiu et al., 2018), skipgram based models (Perozzi et al., 2014; Grover and Leskovec, 2016; Chen et al., 2018), and
graph neural networks (GNNs)
(Kipf and Welling, 2017; Hamilton et al., 2017; Velickovic et al., 2018; Xu et al., 2019; Qu et al., 2019). We focus on GNNs as GNN models can not only capture graph topology information as the skipgram and factorization based models, but also incorporate node/edge features. Specifically, we formulate a contrastive GNN framework with four components:a similarity definition, a GNN encoder, a contrastive loss function, and possibly a downstream classification task
.While graph representation algorithms using the “negative sampling” approach are shown to achieve good performance empirically, there is a lack of theoretical analysis on the generalization performance. In addition, we also found that directly applying contrastive pairs and NCEloss to existing GNN models, e.g., graph convolutional networks (GCNs) (Kipf and Welling, 2017), does not always work well (as shown in Section 7). In order to understand the generalization performance of the algorithms and find out when the direct application of contrastive pairs and NCEloss does not work, we derive a generalization bound for our contrastive GNN framework using the theoretical framework proposed in (Saunshi et al., 2019). Our generalization bound reveals that the high scales of node representations’ norms and the high variance among them are two main factors that hurt the generalization performance.
To solve the problems caused by the two factors, we propose a novel regularization method, ContrastReg
. ContrastReg uses a regularization vector
, which is a random vector with each element in the range . We learn a graph representation model by forcing the representations of all nodes to be similar to and all the contrastive representations calculated by shuffling node features to be dissimilar to . We show from the geometric perspective that ContrastReg stabilizes the scales of norms and reduces their variance. We also validate by experiments that ContrastReg significantly improves the quality of node representations for a popular GNN model using different similarity definitions.2. Related work
Graph representation learning. Many graph representation learning models have been proposed. Factorization based models (Ahmed et al., 2013; Cao et al., 2015; Qiu et al., 2018) factorize an adjacency matrix to obtain node representations. Random walk based models such as DeepWalk (Perozzi et al., 2014) sample node sequences as the input to skipgram models to compute the representation for each node. Node2vec (Grover and Leskovec, 2016) balances depthfirst and breadthfirst random walk when it samples node sequences. HARP (Chen et al., 2018) compresses nodes into supernodes to obtain a hierarchical graph to provide hierarchical information to random walk. GNN models (Kipf and Welling, 2017; Hamilton et al., 2017; Velickovic et al., 2018; Xu et al., 2019; Qu et al., 2019) have shown great capability in capturing both graph topology and node/edge feature information. Most GNN models follow a neighborhood aggregation schema, in which each node receives and aggregates the information from its neighbors in each GNN layer, i.e., for the th layer, , and . This work proposes a regularization method, ContrastReg, for GNN models and improves their performance for downstream tasks.
Contrastive learning.
Contrastive learning is a selfsupervised learning method that learns representations by contrasting positive pairs against negative pairs. Contrastive pairs can be constructed in different ways for different types of data and tasks, such as multiview
(Tian et al., 2020, 2019), targettonoise (van den Oord et al., 2018; Hénaff et al., 2019), mutual information (Belghazi et al., 2018; Hjelm et al., 2019), instance discrimination (Wu et al., 2018), context cooccurrence (Mikolov et al., 2013), clustering (Asano et al., 2020; Caron et al., 2018), and multiple data augmentation (Chen et al., 2020). In addition to the above unsupervised learning settings,
(Tian et al., 2019; Chen et al., 2020; van den Oord et al., 2018) also show great success in capturing information that can be transferred to new tasks from different domains.Contrastive learning has been successfully applied in many graph representation learning models such as (Perozzi et al., 2014; Grover and Leskovec, 2016; Tang et al., 2015; Velickovic et al., 2019; Peng et al., 2020; Hamilton et al., 2017). In this work, we apply contrastive learning in GNNs and propose a contrastive GNN framework. The stateoftheart models such as (Velickovic et al., 2019; Peng et al., 2020), which both use the GCN encoder, can be seen as special instances of this algorithmic framework. We show that with the same GCN encoder and similar contrastive pair designs, our models can significantly outperform (Velickovic et al., 2019; Peng et al., 2020) by adopting ContrastReg.
Noise Contrastive Estimation loss.
NCEloss was originally proposed to reduce the computation cost of estimating the parameters of a probabilistic model using logistic regression to discriminate between the observed data and artificially generated noises
(Gutmann and Hyvärinen, 2012; Dyer, 2014; Mnih and Teh, 2012). It has been successfully applied to contrastive learning (Tian et al., 2020, 2019; van den Oord et al., 2018; Hénaff et al., 2019; Belghazi et al., 2018; Hjelm et al., 2019; Wu et al., 2018; Chen et al., 2020). There are works aiming to explain the success of NCEloss. Gutmann and Hyvärinen (2012)proved that when NCEloss serves as the objective function of the parametric density estimation problem, the estimation of the parameters converges in probability to the optimal estimation.
Yang et al. (2020b) showed that when NCEloss is applied in graph representation learning, the mean squared error of the similarity between two nodes is related to the negative sampling strategy. However, the definition of optimal representations or optimal parameters does not consider downstream tasks, but is based on predefined structures. In this paper, we adopt the theoretical settings from the contrastive learning framework proposed by Saunshi et al. (2019) to analyze the generalizability of NCEloss in downstream tasks, i.e., linear classification tasks. To the best of our knowledge, we are the first to theoretically analyze NCEloss under the contrastive learning setting (Saunshi et al., 2019).Nodelevel similarity. NCEloss has been adopted in many graph representation learning models to capture different types of nodelevel similarity. We characterize them as follows:

Structural similarity: We may capture structural similarity from different angles. From the graph theory perspective, GraphWave (Donnat et al., 2018) leverages the diffusion of spectral graph wavelets to capture structural similarity, and struc2vec (Ribeiro et al., 2017) uses a hierarchy to measure node similarity at different scales. From the induced subgraph perspective, GCC (Qiu et al., 2020) treats the induced subgraphs of the same ego network as similar pairs and those from different ego networks as dissimilar pairs. To capture the community structure, vGraph (Sun et al., 2019) utilizes the high correlation of community detection and node representations to make node representations contain more community structure information. To capture the globallocal structure, DGI (Velickovic et al., 2019) maximizes the mutual information between node representations and graph representations to allow node representations to contain more global information.
We will show that ContrastReg facilitates the contrastive training of graph representation learning models regardless of the different designs of contrastive pairs being used, and thus is helpful in capturing all types of similarities.
Regularization for graph representation learning. In addition to the general regularization terms used in machine learning such as L1/L2 regularization, there are regularizers proposed for graph representation learning models. GraphAT (Feng et al., 2019) and BVAT (Deng et al., 2019) add adversarial perturbation to the input data as a regularizer to obtain more robust models. GraphSGAN (Ding et al., 2018)
generates fake input data in low density region by taking a generative adversarial network as regularizer. Preg
(Yang et al., 2020a) makes use of the smoothness property in realworld graphs to improve GNN models. Most of the above regularizers are designed for supervised tasks and perform well only in supervised settings, while ContrastReg is the first regularizer designed for contrastive learning and achieves excellent performance in unsupervised settings. We will also show its advantages over traditional regularizers, e.g., weight decay (L2 regularization of the model parameters) (Loshchilov and Hutter, 2019) in Section 5.3. Preliminaries
We briefly discuss the theoretical framework (Saunshi et al., 2019) for contrastive learning and NCEloss, which is the foundation of Section 4.
3.1. Concepts in Contrastive Learning
Consider the feature space , the goal of contrastive learning is to train an encoder for all input data points by constructing positive pairs and negative pairs . To formally analyze the behavior of contrastive learning, Saunshi et al. (2019) introduce the following concepts.

Latent classes: Data are considered as drawn from latent classes with distribution . Further, distribution is defined over feature space that is associated with a class to measure the relevance between and .

Semantic similarity: Positive samples are drawn from the same latent classes, with distribution
(1) while negative samples are drawn randomly from all possible data points, i.e., the marginal of , as
(2) 
Supervised tasks: Denote as the number of negative samples. The object of the supervised task, i.e., featurelabel pair , is sampled from
where , and with .
Mean classifier
is naturally imposed to bridge the gap between the representation learning performance and linear separability of learn representations, as 
Empirical Rademacher complexity: Suppose . Given a sample ,
where , with
are independent random variables taking values uniformly from
.
In addition, the theoretical framework in (Saunshi et al., 2019) makes an assumption: encoder is bounded, i.e., , .
3.2. Contrastive Learning with NCEloss
The contrastive loss defined by Saunshi et al. (2019) is
where can be the hinge loss as or the logistic loss as . And its supervised counterpart is defined as
A more powerful loss function, NCEloss, used in (Velickovic et al., 2019; Yang et al., 2020b; Mnih and Teh, 2012; Dyer, 2014), can be framed as
(3) 
and its empirical counterpart with samples is given as
(4) 
where
is the sigmoid function.
For its supervised counterpart, it is exactly the cross entropy loss for the way multiclass classification task:
(5) 
4. Theoretical Analyses on NCEloss
In this section, we first give the upper bound of supervised loss when training a model using NCEloss. Then we discuss the generalization bound of NCEloss along with the generalization bounds of hinge loss and logistic loss, and show their limitations.
4.1. The Generalization Bound of NCEloss
We give the generalization error of function class on the unsupervised loss function in Theorem 4.1. Since we focus the regularization in contrastive learning, we give the result based on a single negative sample, i.e., .
Let be two classes sampled independently from latent classes with distribution . Let be the probability that and come from the same class. And and are NCEloss when negative samples come from the same and different class, respectively. We have the following theorem.
Theorem 4.1 ().
, with probability at least ,
(6) 
where , , , , , and
Remark 0 ().
The above theorem tells us if contrastive learning algorithms with NCEloss could make converge to 0 as increases, the picked encoder will have good performance in downtream tasks. In other words, we could guarantee contrastive learning algorithms to obtain high quality representations by minimizing under the condition that will converge to 0 with a large amount of data.
To prove Theorem 4.1, we first list some key lemmas.
Lemma 4.2 ().
For all ,
(7) 
This bound connects contrastive representation learning algorithms and its supervised counterpart. This lemma is achieved by Jensen’s inequality. The details are given in Appendix A.1.
Lemma 4.3 ().
With probability at least over the set , for all ,
(8) 
This bound guarantees that the chosen cannot be too much worse than . The proof applies Rademacher complexity of the function class (Mohri et al., 2018) and vectorcontraction inequality (Maurer, 2016). More details are given in Appendix A.2.
Lemma 4.4 ().
.
This bound is derived by the loss caused by both positive and negative pairs that come from the same class, i.e., class collision. The proof uses Bernoulli’s inequality (details in Appendix A.3).
4.2. Discussion on the Generalization Bound
We now discuss the generalization error of NCEloss in the contrastive learning setting.
4.2.1. Discussion on and
in Eq. (6) is the generalization error in terms of Rademacher complexity. It shows that when the encoder function is bounded and the number of samples is large enough, obtained by minimizing provides performance guarantee. Note that when satisfying is a bounded Lipschitz continuous function and encoder is bounded, the generalization error of different contrastive loss terms will only differ in the contraction rate, i.e., Lipschitz continuous constant.
in Eq. (6) can be further rewritten as
where . It shows that NCEloss is prone to be disturbed by large representation norms.
4.2.2. Cases that make contrastive learning suboptimal
There are two cases where contrastive learning algorithms cannot guarantee that works in downstream tasks as pointed out in (Saunshi et al., 2019), which also applies for NCELoss. Case 1. The optimal for the downstream task can have large () and thus failure of the algorithm, because of large spurious components in the representations that are orthogonal to the separation plain in the downstream task. Case 2. High intraclass deviation makes () large even if both of its supervised counterpart losses and () are small, resulting in failure of the algorithm.
There is an additional case for NCEloss (Case 3). The optimal for the downstream task can have large and , which lead to large and large , even if gives low intraclass deviation.
Example. Figure 1 depicts an example with , , , and and . In this example, the linear separability of is better than in both Figure 0(a) and 0(b), while . In the case of (case 1 and 3), the contrastive learning algorithm using NCEloss will converge to pick since and . When , will be chosen, since in Eq.(6) (case 3).
We remark that once we avoid the problems of case 3, the problems of case 1 and case 2 cannot be serious. For case 1, since when both and are not large, the length of the orthogonal project is comparable to the separation plane so as to avoid destroying the contrast ability. For case 2, mild scale and avoids large intraclass deviation caused by large variance of the representation norm.
We further show that case 3 is not an artificial case, but exists in practice. We use the training status of only using one contrastive loss computed by structural similarity on the Cora dataset
(Yang et al., 2016) (Section 6) to demonstrate the issues with high expectation and high variance of representation norms. DenoteFigure 2 shows the variance and the mean of representation norms (left top), the ratio of to (right top), and
(left bottom), and the testing accuracy (in a node classification task) during training for 300 epochs. The variance and mean value of representation norms increase with the progression of epochs. This increases
and significantly, while the ratio and representation quality (indicated by the test accuracy) decrease. In the following sections, we use the ratio to measure the contrasting ability of models, i.e., the ability to contrast different classes.5. Contrastive Regularization
The theoretical analysis in Section 4 shows that a good contrastive representation learning algorithm should satisfy the following conditions: 1. avoiding large representation norm ; 2. avoiding large norm variance ; 3. preserving contrast.
We remark that the norm variance measures how far the norms of node representations are from their average value. It is different from intraclass variance (Saunshi et al., 2019)
, which is the largest eigenvalue of covariance matrix
. This is also the reason why case 2 is different from case 3 in the previous example.In order to satisfy the above conditions, we propose a contrastive regularization term, ContrastReg:
(10) 
where is a random vector uniformly sampled from , is the trainable parameter, and is the noisy features. Different data augmentation techniques such as (Chen et al., 2020; Velickovic et al., 2019) can be applied to generate the noisy features. In Section 6, we will discuss how we calculate the noisy features in the GNN setting.
We give the motivation of ContrastReg’s design as follows. Consider an artificial downstream task that is to learn a classifier to discriminate the representations of regular data and noisy data. in Eq. (10) can be viewed as the classification loss and can be viewed as the parameter of the bilinear classifier. The classifier prefers the encoder that can make the representations of intraclass data points more condensed and the interclass representations more separable. We use the GCN model on the Cora dataset (Yang et al., 2016) as an example. Figure 2(a) and Figure 2(b) show the tSNE visualization of and before and after the optimization on , respectively.
We can observe that the learned representations are closer to each other (i.e., the range of representations in Figure 2(a) are smaller than that in Figure 2(b)), while preserving the separability among the representations (i.e., the points with the same label share the same color).
5.1. Theoretical Guarantees for ContrastReg
Before stating the theorem, we give the following lemma to show that can be effectively reduced when is large by adding ContrastReg.
Lemma 5.1 ().
For a random variable , a constant and a constant , we have
(11) 
Proof.
First, we consider
where is strictly decreasing in and strictly increasing in , and is the solution of . Thus, we can approximate the range of by the fact that for all and .
Thus, for ,
and since is monotonically increasing, we get
When ,
Further, we assume that and are i.i.d. random variables sampled from ,
∎
Theorem 5.2 ().
Minimizing Eq. (10) induces the decrease in when .
Proof.
We minimize by gradient descent with learning rate .
(12) 
Eq. (12) shows that in every optimization step, extends by along . If we do orthogonal decomposition for along and its unit orthogonal hyperplain , . Thus we have
(13) 
The projection of along is while the projection of plus the ContrastReg update along is
Note that .
when and , we have
(14) 
∎
Remark 1 ().
, which is the condition of Eq. (14) , is not difficult to satisfy, since the magnitude of could be tuned. In practice, can fit in all our experiments.
5.2. Understanding the effects of ContrastReg
Theorem 5.2 shows that ContrastReg can reduce when is large, which is proved from the geometric perspective. Figure 4 visualizes the geometric interpretation of ContrastReg. In one gradient descent step, and are the representation before and after the gradient descent update of ContrastReg. For any data point pair and , we decompose and along and its orthogonal direction. Minimizing Eq. (10) is consequently extending in each step along while preserving the length in its orthogonal direction. When we compare and , together with and , we conclude that .
Note that the mean and the variance are positively correlated. When we compare with , the former has higher norm and variance for . Theorem 5.2 shows that our ContrastReg can reduce when is large, and it should also prefer lower mean because the variance will increase when the norm is scaled to a larger value. Figure 5 shows that the mean and variance are reduced when we apply ContrastReg compared to only using one contrastive loss computed by structural similarity. Also, the representation quality is improved significantly.
We also discuss two questions regarding the effects of ContrastReg.
Does ContrastReg degrade the contrastive learning algorithm to a trivial solution, i.e., all representations converge to one point, even to the origin?
We highlight that Theorem 5.2 is to reduce rather than , and ContrastReg does not force all the representations into one point. From Figure 4, we know that adding ContrastReg only reduces the variance in the representations along the direction of , while preserving the difference along its orthogonal direction. Therefore, ContrastReg not only reduces , but also preserves the contrasting ability of the contrastive learning algorithm, as shown in Figure 2(b). Furthermore, as we randomize in each training step, the variance reduction on the representation norm is conducted on various directions. Thus, the representations will not have the same dominant direction, and ContrastReg does not make the representations converge to one point.
Can other regularization/normalization methods, e.g., weight decay or final representation normalization, solve the norm problem? Other regularization and normalization techniques may help stabilize the mean of the representation norms and reduce their variance, but they cannot replace ContrastReg as ContrastReg leads to more stable changes in the representation norms and preserves high contrast between positive and negative samples.
We compare ContrastReg with weight decay and normalization and show their performance during a 300epoch training process. Figure 6 shows the performance of weight decay and ContrastReg. From the lefttop figure, we can see that ContrastReg gives smaller and more stable representation norms than weight delay. Specifically, although a large weight decay rate can be applied to obtain smaller representation norms, it leads to fluctuation. The fluctuation in the representation norms impairs the training process, as reflected by the fluctuation in training losses (leftbottom figure). In addition, from the ratio , we know that ContrastReg preserves better contrasting ability.
Chen et al. (2020) show that adding
normalization (i.e., using cosine similarity rather than inner product) with a temperature parameter improves the representation quality empirically. Figure
7 compares ContrastReg with normalization. normalization gives much smaller than ContrastReg, meaning that the separation between contrastive pairs provided by normalization is not as clear as that provided by ContrastReg. This is because normalization not only minimizes the variance in the representation norms, but also reduces the differences among the representations, rendering smaller contrast among the representations of data points in different classes. Thus, normalization gives less improvement in the representation quality than ContrastReg.6. A Contrastive GNN Framework
We present our contrastive GNN framework in Algorithm 1. Given a graph and node attributes , we train a GNN model for epochs. Node representations can be obtained through and used as the input to downstream tasks. We adopt NCEloss as the contrastive loss in our framework. For each training epoch, we first select a seed node set for computing NCEloss by (Line 3). Then Line 4 invokes to construct a positive sample and a negative sample for each node in , where returns a set of 3tuples consisting of the representations for the seed nodes, the positive samples and the negative samples, respectively. After that, Line 5 computes the training loss by adding NCEloss on and the regularization loss calculated by ContrastReg, and Line 6 updates by backpropagation.
As mentioned in Section 5, ContrastReg requires noisy features for contrastive regularization. In our contrastive GNN framework, we generate the noisy features by simply shuffling node features among nodes following the corruption function in (Velickovic et al., 2019).
For different node similarity definitions, different functions are designed to select seed nodes that bring good training effects, while different functions are designed to generate suitable contrastive pairs for the seed nodes. In the following, we demonstrate by examples the designs of three contrastive GNN models for structure, attribute, and proximity similarity, respectively, which are also used in our experimental evaluation in Section 7.
6.1. Structure Similarity
We give an example model () that captures the community structure inherent in graph data (Newman, 2006). As clustering is a common and effective method for detecting communities in a graph, we conduct clustering in the node representation space to capture community structures. LC borrows the design from (Huang et al., 2019) and implements local clustering by and curriculum learning by . We remark that other methods such as global clustering (Caron et al., 2018) and instance discrimination (Wu et al., 2018) can also be adapted into our contrastive GNN framework by different implementations of and .
Algorithm 2 shows the implementation of and in LC. For each seed node , generates a positive node from the nodes that have the highest similarity scores with , and a negative node randomly sampled from (Lines 24). selects nodes with the smallest entropy to avoid high randomness and uncertainty at the start of the training. For every epochs, gradually adds more nodes with larger entropy to be computed in the contrastive loss with the progression of epochs (Lines 1113).
6.2. Attribute Similarity
Models adopting attribute similarity assume that nodes with similar attributes are expected to have similar representations, so that the attribute information should be preserved. Hjelm et al. (2019); Peng et al. (2020) proposed contrastive pair designs to maximize the mutual information between lowlevel representations (the input features) and highlevel representations (the learned representations). Algorithm 3 presents our model, , which adapts their multilevel representation design into our contrastive GNN framework.
In Algorithm 3, selects all nodes in a graph as seeds. uses node itself as the positive node for each seed node , but in the returned 3tuple, the representation of as the seed node is different from the representation of as the positive node. The second element in the 3tuple is ’s representation , while the first element is calculated by stacking an additional GNN layer upon . For negative nodes, randomly samples a node in for each seed node.
6.3. Proximity Similarity
The assumption behind proximity similarity is that nodes are expected to have similar representation when they have high proximity (i.e., they are near neighbors). To capture proximity information among nodes, we implement and following the setting in unsupervised GraphSAGE (Kipf and Welling, 2017). Adjacent nodes are selected to be positive pairs, while negative pairs are sampled from nonadjacent nodes.
Dataset  Node #  Edge #  Feature #  Class # 

Cora  2,708  5,429  1,433  7 
Citeseer  3,327  4,732  3,703  6 
Pubmed  19,717  44,338  500  3 
ogbnarxiv  169,343  1,166,243  128  40 
Wiki  2,405  17,981  4,973  3 
Computers  13,381  245,778  767  10 
Photo  7,487  119,043  745  8 
ogbnproducts  2,449,029  61,859,140  100  47 
232,965  114,615,892  602  41 
Algorithm  Cora  Wiki  Computers  

ML  73.22  58.70  77.08  94.33 
ML+reg  82.65  67.20  80.30  94.38 
LC  79.73  68.96  79.80  94.42 
LC+reg  82.33  69.19  80.89  94.43 
CO  75.49  68.52  81.02  93.85 
CO+reg  83.63  70.05  81.37  93.92 
Algorithm  Cora  Citeseer  Pubmed  ogbnarxiv  Wiki  Computers  Photo  ogbnproducts  

GCN  81.54  71.25  79.26  71.74  72.40  79.82  88.75  75.64  94.02 
node2vec  71.07  47.37  66.34  70.07  58.76  75.37  83.63  72.49  93.26 
DGI  81.90  71.85  76.89  69.66  63.70  64.92  77.19  77.00  94.14 
GMI  80.95  71.11  77.97  68.36  63.35  79.27  87.08  75.55  94.19 
ours (LC)  82.33  72.88  79.33  69.94  69.19  80.89  87.59  76.96  94.43 
ours (ML)  82.65  72.98  80.10  70.05  67.20  80.30  86.78  76.27  94.38 
Algorithm  Cora  Citeseer  Wiki  

Acc  NMI  F1  Acc  NMI  F1  Acc  NMI  F1  
node2vec  61.78  44.47  62.65  39.58  24.23  37.54  43.29  37.39  36.35 
DGI  71.81  54.90  69.88  68.60  43.75  64.64  44.37  42.20  40.16 
AGC  68.93  53.72  65.62  68.37  42.44  63.73  49.54  47.02  42.16 
GMI  63.44  50.33  62.21  63.75  38.14  60.23  42.81  41.53  38.52 
ours (LC)  70.04  55.08  67.36  67.90  43.63  64.21  50.12  49.70  43.74 
ours (ML)  71.59  56.01  68.11  69.17  44.47  64.74  53.13  51.81  46.11 
7. Experimental Results
In this section, we first show that ContrastReg can be generally used for various designs of contrastive pairs. Then we evaluate the performance of various models trained with ContrastReg in both graph representation learning and pretraining settings, where contrastive learning is successfully applied.
Datasets. The datasets we used include citation networks such as Cora, Citeseer, Pubmed (Yang et al., 2016) and ogbnarxiv (Hu et al., 2020a), web graphs such as Wiki (Yang et al., 2015), copurchase networks such as Computers, Photo (Shchur et al., 2018) and ogbnproducts (Hu et al., 2020a), and social networks such as Reddit (Hamilton et al., 2017). Some statistics of the datasets are given in Table 1.
Models. We denote Algorithm 2 (local clustering) capturing structure similarity as ours (LC), Algorithm 3 (multilevel representations) capturing attribute similarity as ours (ML) and the algorithm (cooccurrence) capturing proximity similarity as ours (CO).
Unsupervised training procedure. We used full batch training for Cora, Citeseer, Pubmed, ogbnarxiv, Wiki, Computers and Photo, while we used stochastic minibatch training for Reddit and ogbnproducts. For Cora, Citeseer, Pubmed, ogbnarxiv, ogbnproducts and Reddit, we used the standard split provided in the datasets and fixed the randoms seeds from 0 to 9 for 10 different runs. For Computers, Photo and Wiki, we randomly split the train/validation/test as 20 nodes/30 nodes/all the remaining nodes per class, as recommended in (Shchur et al., 2018). The performance was measured on 25 different runs, with 5 random splits and 5 fixedseed runs (from 0 to 4) for each random split. For Wiki, we removed the edge attributes for all models for fair comparison. The additional special designs for link prediction task and pretraining setting are given in their respective subsections.
7.1. Generalizability of ContrastReg
To evaluate the performance gain by ContrastReg, we tested the model performance (on node classification accuracy) with and without ContrastReg on four networks from four different domains. GCN encoder was used on Cora, Wiki and Computers. GraphSage with GCNaggregation encoder was used on Reddit. Table 2 shows that ContrastReg can help better capture different types of similarity, i.e., ML for attribute similarity, LC for structure similarity, and CO for proximity similarity, and improve the performance of the models in all cases. In the following experiments, we omit CO since its contrast loss is computed by sampled edges and thus the computation cost is larger than the other two contrast designs.
7.2. Graph Representation Learning
Next we show that high quality representations can be learned by our method. High quality means using these representations, simple models (e.g., linear classifier for classification and means for clustering) can easily achieve high performance on various downstream tasks. To show that, we evaluated the learned representations on three downstream tasks that are fundamental for graph analytics: node classification, graph clustering, and link prediction.
7.2.1. Node Classification
We evaluated the performance of node classification on all datasets, using both full batch training and stochastic minibatch training. We compared our methods with DGI (Velickovic et al., 2019), GMI (Peng et al., 2020), node2vec (Grover and Leskovec, 2016), and supervised GCN (Kipf and Welling, 2017). DGI and GMI are the stateoftheart algorithms in unsupervised graph representation learning. Node2vec is the representative algorithm for random walk based graph representation algorithms (Grover and Leskovec, 2016; Tang et al., 2015; Perozzi et al., 2014). GCN is a classic supervised GNN model. We report ours (LC) and ours (ML), both using ContrastReg, where their GNN encoder is GCN for full batch training and GraphSage (Hamilton et al., 2017) with GCNaggregation for stochastic training, respectively. The encoder settings are the same as in DGI and GMI. Our framework can also adopt other encoder such as GAT and similar performance improvements over GAT can also be obtained. We omit the detailed results due to the page limit.
Table 3
reports node classification accuracy with standard deviation. The results show that our algorithms achieve better performance in the majority of the cases, for both full batch training (on Cora, Citeseer, Pubmed, Computers, Photo and Wiki) and stochastic training (on Reddit and Ogbnproducts). Our unsupervised algorithms can even outperform the supervised GCN. Compared with DGI and GMI, our model is similar to the DGI model with a properly designed contrast pair and to the GMI model with the ContrastReg term, and thus we can achieve better performance in most cases. If we compare Table
2 and Table 3, it shows that the performance gain is from ContrastReg rather than a more proper contrast design.7.2.2. Graph Clustering
Following the work of Xia et al. (2014), we used three metrics to evaluate the clustering performance: accuracy (Acc), normalized mutual information (NMI), and F1macro (F1). For all these three metrics, a higher value indicates better clustering performance. We compared our methods with DGI, node2vec, GMI, and AGC (Zhang et al., 2019) on Cora, Citeseer and Wiki. AGC (Zhang et al., 2019) is a stateoftheart graph clustering algorithm, which exploits highorder graph convolution to do attribute graph clustering. For all models and all datasets, we used means to cluster both the labels and representations of nodes. The clustering results of labels are taken as the ground truth. Since high dimension is harmful to clustering (Chen, 2009), we applied the PCA algorithm to the representations to reduce the dimensionality before using means. The random seed setting for model training was the same as that in the node classification task. And to reduce the randomness caused by means, we set the random seed of clustering from 0 to 4, and took the average result for each learned representations. For each cell in Table 4, we report the better result with PCA and without PCA. The results show that our algorithms, especially ours (ML), achieve better performance in most cases, which again demonstrates the effectiveness of ContrastReg. Note that graph clustering is applied on attribute graphs, the fact that the results of ours (ML) are better than ours (LC) tells us that attributes play an important role in clustering.
Algorithm  Cora  Citeseer  Pubmed  Wiki 

GCN–neg  92.40  92.27  97.24  93.27 
node2vec  86.33  79.60  81.74  92.41 
DGI  93.62  95.03  97.24  95.55 
GMI  91.31  92.23  95.14  95.30 
ours (LC)  94.61  95.63  97.26  96.28 
7.2.3. Link Prediction
The representations learned in Sections 7.2.1 and 7.2.2 should not be directly used in the link prediction task, because the encoder already has access to all edges in an input graph when we train it using contrastive learning, which leads to the data linkage issue (i.e., the edges used in the prediction task being accessible in the training process). Thus, for link prediction, an inductive setting of graph representation learning was adopted. We extracted random induced subgraphs (85% of the edges) from each origin graph to train the representation learning model and the link predictor. The remaining edges were used to validate and test the link prediction results (10% of the edges as the test edge set, 5% as the validation edge set). The performance was evaluated on 25 (5x5) different runs, with 5 different induced subgraphs (fixedseed random split scheme) and 5 fixedseed runs (from 0 to 4). we compared our model with DGI, GMI, node2vec and unsupervised GCN (i.e., GCNneg in Table 5) on Cora, Citeseer, Pubmed and Wiki. The results in Table 5 show that our algorithms achieve better performance than the stateoftheart methods. We did not conduct the ML model in this experiment because ML pays more attention to the node attributes.
Algorithm  ogbnproducts  

No pretraining  90.44  84.69 
DGI  92.09  86.37 
GMI  92.13  86.14 
ours (ML)  92.18 
Comments
There are no comments yet.