# Improving Graph Representation Learning by Contrastive Regularization

Graph representation learning is an important task with applications in various areas such as online social networks, e-commerce networks, WWW, and semantic webs. For unsupervised graph representation learning, many algorithms such as Node2Vec and Graph-SAGE make use of "negative sampling" and/or noise contrastive estimation loss. This bears similar ideas to contrastive learning, which "contrasts" the node representation similarities of semantically similar (positive) pairs against those of negative pairs. However, despite the success of contrastive learning, we found that directly applying this technique to graph representation learning models (e.g., graph convolutional networks) does not always work. We theoretically analyze the generalization performance and propose a light-weight regularization term that avoids the high scales of node representations' norms and the high variance among them to improve the generalization performance. Our experimental results further validate that this regularization term significantly improves the representation quality across different node similarity definitions and outperforms the state-of-the-art methods.

## Authors

• 3 publications
• 1 publication
• 11 publications
• 1 publication
• 12 publications
• 1 publication
• 2 publications
• 35 publications
• ### Fairness-Aware Node Representation Learning

Node representation learning has demonstrated its effectiveness for vari...
06/09/2021 ∙ by Öykü Deniz Köse, et al. ∙ 0

• ### Debiased Contrastive Learning

A prominent technique for self-supervised representation learning has be...
07/01/2020 ∙ by Ching-Yao Chuang, et al. ∙ 0

• ### Prototypical Graph Contrastive Learning

Graph-level representations are critical in various real-world applicati...
06/17/2021 ∙ by Shuai Lin, et al. ∙ 0

• ### Understanding Negative Sampling in Graph Representation Learning

Graph representation learning has been extensively studied in recent yea...
05/20/2020 ∙ by Zhen Yang, et al. ∙ 24

• ### Maximizing Cohesion and Separation in Graph Representation Learning: A Distance-aware Negative Sampling Approach

The objective of unsupervised graph representation learning (GRL) is to ...
07/02/2020 ∙ by M. Maruf, et al. ∙ 0

• ### Social NCE: Contrastive Learning of Socially-aware Motion Representations

Learning socially-aware motion representations is at the core of recent ...
12/21/2020 ∙ by Yuejiang Liu, et al. ∙ 0

• ### Evaluating Modules in Graph Contrastive Learning

The recent emergence of contrastive learning approaches facilitates the ...
06/15/2021 ∙ by Ganqu Cui, et al. ∙ 16

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1. Introduction

Graph is widely used to capture rich information (e.g., hierarchical structures, communities) in data from various domains such as social networks, e-commerce networks, knowledge graphs, WWW and semantic webs. By incorporating graph topology and node/edge features into machine learning models,

graph representation learning has achieved great success in many important applications such as node classification, link prediction, and graph clustering.

A large number of graph representation learning algorithms (Velickovic et al., 2019; Peng et al., 2020; Tang et al., 2015; Perozzi et al., 2014; Grover and Leskovec, 2016; Ahmed et al., 2013; Cao et al., 2015; Qiu et al., 2018; Chen et al., 2018; Kipf and Welling, 2017; Hamilton et al., 2017; Velickovic et al., 2018; Xu et al., 2019; Qu et al., 2019) have been proposed. Among them, many (Tang et al., 2015; Perozzi et al., 2014; Grover and Leskovec, 2016; Hamilton et al., 2017) are designed in an unsupervised manner and make use of “negative sampling” to learn node representations. This design shares similar ideas as contrastive learning (He et al., 2020; Tian et al., 2020, 2019; van den Oord et al., 2018; Hénaff et al., 2019; Belghazi et al., 2018; Hjelm et al., 2019; Wu et al., 2018; Mikolov et al., 2013; Asano et al., 2020; Caron et al., 2018; Chen et al., 2020), which “contrasts” the similarities of the representations of similar (or positive) node pairs against those of negative pairs. These algorithms adopt noise contrastive estimation loss (NCEloss), while they differ in the definition of node similarity (hence the design of contrastive pairs) and encoder design.

Existing graph representation learning algorithms mainly fall into three categories: adjacency matrix factorization based models (Ahmed et al., 2013; Cao et al., 2015; Qiu et al., 2018), skip-gram based models (Perozzi et al., 2014; Grover and Leskovec, 2016; Chen et al., 2018), and

graph neural networks (GNNs)

(Kipf and Welling, 2017; Hamilton et al., 2017; Velickovic et al., 2018; Xu et al., 2019; Qu et al., 2019). We focus on GNNs as GNN models can not only capture graph topology information as the skip-gram and factorization based models, but also incorporate node/edge features. Specifically, we formulate a contrastive GNN framework with four components:

a similarity definition, a GNN encoder, a contrastive loss function, and possibly a downstream classification task

.

While graph representation algorithms using the “negative sampling” approach are shown to achieve good performance empirically, there is a lack of theoretical analysis on the generalization performance. In addition, we also found that directly applying contrastive pairs and NCEloss to existing GNN models, e.g., graph convolutional networks (GCNs) (Kipf and Welling, 2017), does not always work well (as shown in Section 7). In order to understand the generalization performance of the algorithms and find out when the direct application of contrastive pairs and NCEloss does not work, we derive a generalization bound for our contrastive GNN framework using the theoretical framework proposed in (Saunshi et al., 2019). Our generalization bound reveals that the high scales of node representations’ norms and the high variance among them are two main factors that hurt the generalization performance.

To solve the problems caused by the two factors, we propose a novel regularization method, Contrast-Reg

. Contrast-Reg uses a regularization vector

, which is a random vector with each element in the range . We learn a graph representation model by forcing the representations of all nodes to be similar to and all the contrastive representations calculated by shuffling node features to be dissimilar to . We show from the geometric perspective that Contrast-Reg stabilizes the scales of norms and reduces their variance. We also validate by experiments that Contrast-Reg significantly improves the quality of node representations for a popular GNN model using different similarity definitions.

Outline. Section 2 discusses related work. Section 3 gives some preliminaries and Section 4 analyzes the generalization bound. Section 5 proposes Contrast-Reg and Section 6 presents the contrastive GNN framework. Section 7 reports the experimental results.

## 2. Related work

Graph representation learning. Many graph representation learning models have been proposed. Factorization based models (Ahmed et al., 2013; Cao et al., 2015; Qiu et al., 2018) factorize an adjacency matrix to obtain node representations. Random walk based models such as DeepWalk (Perozzi et al., 2014) sample node sequences as the input to skip-gram models to compute the representation for each node. Node2vec (Grover and Leskovec, 2016) balances depth-first and breadth-first random walk when it samples node sequences. HARP (Chen et al., 2018) compresses nodes into super-nodes to obtain a hierarchical graph to provide hierarchical information to random walk. GNN models (Kipf and Welling, 2017; Hamilton et al., 2017; Velickovic et al., 2018; Xu et al., 2019; Qu et al., 2019) have shown great capability in capturing both graph topology and node/edge feature information. Most GNN models follow a neighborhood aggregation schema, in which each node receives and aggregates the information from its neighbors in each GNN layer, i.e., for the -th layer, , and . This work proposes a regularization method, Contrast-Reg, for GNN models and improves their performance for downstream tasks.

Contrastive learning.

Contrastive learning is a self-supervised learning method that learns representations by contrasting positive pairs against negative pairs. Contrastive pairs can be constructed in different ways for different types of data and tasks, such as multi-view

(Tian et al., 2020, 2019), target-to-noise (van den Oord et al., 2018; Hénaff et al., 2019), mutual information (Belghazi et al., 2018; Hjelm et al., 2019), instance discrimination (Wu et al., 2018), context co-occurrence (Mikolov et al., 2013), clustering (Asano et al., 2020; Caron et al., 2018), and multiple data augmentation (Chen et al., 2020)

. In addition to the above unsupervised learning settings,

(Tian et al., 2019; Chen et al., 2020; van den Oord et al., 2018) also show great success in capturing information that can be transferred to new tasks from different domains.

Contrastive learning has been successfully applied in many graph representation learning models such as (Perozzi et al., 2014; Grover and Leskovec, 2016; Tang et al., 2015; Velickovic et al., 2019; Peng et al., 2020; Hamilton et al., 2017). In this work, we apply contrastive learning in GNNs and propose a contrastive GNN framework. The state-of-the-art models such as (Velickovic et al., 2019; Peng et al., 2020), which both use the GCN encoder, can be seen as special instances of this algorithmic framework. We show that with the same GCN encoder and similar contrastive pair designs, our models can significantly outperform (Velickovic et al., 2019; Peng et al., 2020) by adopting Contrast-Reg.

Noise Contrastive Estimation loss.

NCEloss was originally proposed to reduce the computation cost of estimating the parameters of a probabilistic model using logistic regression to discriminate between the observed data and artificially generated noises

(Gutmann and Hyvärinen, 2012; Dyer, 2014; Mnih and Teh, 2012). It has been successfully applied to contrastive learning (Tian et al., 2020, 2019; van den Oord et al., 2018; Hénaff et al., 2019; Belghazi et al., 2018; Hjelm et al., 2019; Wu et al., 2018; Chen et al., 2020). There are works aiming to explain the success of NCEloss. Gutmann and Hyvärinen (2012)

proved that when NCEloss serves as the objective function of the parametric density estimation problem, the estimation of the parameters converges in probability to the optimal estimation.

Yang et al. (2020b) showed that when NCEloss is applied in graph representation learning, the mean squared error of the similarity between two nodes is related to the negative sampling strategy. However, the definition of optimal representations or optimal parameters does not consider downstream tasks, but is based on pre-defined structures. In this paper, we adopt the theoretical settings from the contrastive learning framework proposed by Saunshi et al. (2019) to analyze the generalizability of NCEloss in downstream tasks, i.e., linear classification tasks. To the best of our knowledge, we are the first to theoretically analyze NCEloss under the contrastive learning setting (Saunshi et al., 2019).

Node-level similarity. NCEloss has been adopted in many graph representation learning models to capture different types of node-level similarity. We characterize them as follows:

• Structural similarity: We may capture structural similarity from different angles. From the graph theory perspective, GraphWave (Donnat et al., 2018) leverages the diffusion of spectral graph wavelets to capture structural similarity, and struc2vec (Ribeiro et al., 2017) uses a hierarchy to measure node similarity at different scales. From the induced subgraph perspective, GCC (Qiu et al., 2020) treats the induced subgraphs of the same ego network as similar pairs and those from different ego networks as dissimilar pairs. To capture the community structure, vGraph (Sun et al., 2019) utilizes the high correlation of community detection and node representations to make node representations contain more community structure information. To capture the global-local structure, DGI (Velickovic et al., 2019) maximizes the mutual information between node representations and graph representations to allow node representations to contain more global information.

• Attribute similarity: Nodes with similar attributes are likely to have similar representations. GMI (Peng et al., 2020) maximizes the mutual information between node attributes and high-level representations, and Hu et al. (2020b) applies attribute masking to help capture domain-specific knowledge.

• Proximity similarity: Most random walk based models such as DeepWalk (Perozzi et al., 2014), Node2vec (Grover and Leskovec, 2016), and LINE (Tang et al., 2015) share an assumption that nodes with more proximity have higher probability to share the same label.

We will show that Contrast-Reg facilitates the contrastive training of graph representation learning models regardless of the different designs of contrastive pairs being used, and thus is helpful in capturing all types of similarities.

Regularization for graph representation learning. In addition to the general regularization terms used in machine learning such as L1/L2 regularization, there are regularizers proposed for graph representation learning models. GraphAT (Feng et al., 2019) and BVAT (Deng et al., 2019) add adversarial perturbation to the input data as a regularizer to obtain more robust models. GraphSGAN (Ding et al., 2018)

generates fake input data in low density region by taking a generative adversarial network as regularizer. P-reg

(Yang et al., 2020a) makes use of the smoothness property in real-world graphs to improve GNN models. Most of the above regularizers are designed for supervised tasks and perform well only in supervised settings, while Contrast-Reg is the first regularizer designed for contrastive learning and achieves excellent performance in unsupervised settings. We will also show its advantages over traditional regularizers, e.g., weight decay (L2 regularization of the model parameters) (Loshchilov and Hutter, 2019) in Section 5.

## 3. Preliminaries

We briefly discuss the theoretical framework (Saunshi et al., 2019) for contrastive learning and NCEloss, which is the foundation of Section 4.

### 3.1. Concepts in Contrastive Learning

Consider the feature space , the goal of contrastive learning is to train an encoder for all input data points by constructing positive pairs and negative pairs . To formally analyze the behavior of contrastive learning, Saunshi et al. (2019) introduce the following concepts.

• Latent classes: Data are considered as drawn from latent classes with distribution . Further, distribution is defined over feature space that is associated with a class to measure the relevance between and .

• Semantic similarity: Positive samples are drawn from the same latent classes, with distribution

 (1) Dsim(x,x+)=Ec∈ρ[Dc(x)Dc(x+)],

while negative samples are drawn randomly from all possible data points, i.e., the marginal of , as

 (2) Dneg(x−)=Ec∈ρ[Dc(x−)]
• Supervised tasks: Denote as the number of negative samples. The object of the supervised task, i.e., feature-label pair , is sampled from

 DT(x,c)=Dc(x)DT(c),

where , and with .

Mean classifier

is naturally imposed to bridge the gap between the representation learning performance and linear separability of learn representations, as

 Wμc\coloneqqμc=Ex∼Dc[f(x)].
• Empirical Rademacher complexity: Suppose . Given a sample ,

 RS(F)=E→e[supf∈F→eTf(S)],

where , with

are independent random variables taking values uniformly from

.

In addition, the theoretical framework in (Saunshi et al., 2019) makes an assumption: encoder is bounded, i.e., , .

### 3.2. Contrastive Learning with NCEloss

The contrastive loss defined by Saunshi et al. (2019) is

 Lun\coloneqqE(x,x+)∼Dsim,(x−1,⋯,x−K)∼Dneg[ℓ({f(x)T(f(x+)−f(x−i))}Ki=1)],

where can be the hinge loss as or the logistic loss as . And its supervised counterpart is defined as

 Lμsup\coloneqqE(x,c)∼DT(x,c)[ℓ({f(x)Tμc−f(x)Tμc′}c′≠c)].

A more powerful loss function, NCEloss, used in (Velickovic et al., 2019; Yang et al., 2020b; Mnih and Teh, 2012; Dyer, 2014), can be framed as

 (3)

and its empirical counterpart with samples is given as

 (4) ^Lnce\coloneqq−1MM∑i=1[logσ(f(xi)Tf(x+i))+K∑k=1logσ(−f(xi)Tf(x−ij))],

where

is the sigmoid function.

For its supervised counterpart, it is exactly the cross entropy loss for the -way multi-class classification task:

 (5) Lμsup\coloneqq−E(x,c)∼DT(x,c)[logσ(f(x)Tμc)+logσ(−f(x)Tμc′)∣c′≠c].

## 4. Theoretical Analyses on NCEloss

In this section, we first give the upper bound of supervised loss when training a model using NCEloss. Then we discuss the generalization bound of NCEloss along with the generalization bounds of hinge loss and logistic loss, and show their limitations.

### 4.1. The Generalization Bound of NCEloss

We give the generalization error of function class on the unsupervised loss function in Theorem 4.1. Since we focus the regularization in contrastive learning, we give the result based on a single negative sample, i.e., .

Let be two classes sampled independently from latent classes with distribution . Let be the probability that and come from the same class. And and are NCEloss when negative samples come from the same and different class, respectively. We have the following theorem.

###### Theorem 4.1 ().

, with probability at least ,

 (6) Lμsup(^f)≤L≠nce(f)+βs(f)+ηGenM+α,

where , , , , , and

###### Remark 0 ().

The above theorem tells us if contrastive learning algorithms with NCEloss could make converge to 0 as increases, the picked encoder will have good performance in downtream tasks. In other words, we could guarantee contrastive learning algorithms to obtain high quality representations by minimizing under the condition that will converge to 0 with a large amount of data.

To prove Theorem 4.1, we first list some key lemmas.

###### Lemma 4.2 ().

For all ,

 (7) Lμsup(f)≤11−τ(Lnce(f)−τ).

This bound connects contrastive representation learning algorithms and its supervised counterpart. This lemma is achieved by Jensen’s inequality. The details are given in Appendix A.1.

###### Lemma 4.3 ().

With probability at least over the set , for all ,

 (8) Lnce(^f)≤Lnce(f)+GenM.

This bound guarantees that the chosen cannot be too much worse than . The proof applies Rademacher complexity of the function class (Mohri et al., 2018) and vector-contraction inequality (Maurer, 2016). More details are given in Appendix A.2.

###### Lemma 4.4 ().

.

This bound is derived by the loss caused by both positive and negative pairs that come from the same class, i.e., class collision. The proof uses Bernoulli’s inequality (details in Appendix A.3).

###### Proof to Theorem 4.1.

Combining Lemma 4.2 and Lemma 4.3, we obtain with probability at least over the set , for all ,

 (9) Lμsup(^f)≤11−τ(Lnce(f)+GenM−τ)

Then, we decompose , apply Lemma 4.4 to Eq. (9), and obtain the result of Theorem 4.1

### 4.2. Discussion on the Generalization Bound

We now discuss the generalization error of NCEloss in the contrastive learning setting.

#### 4.2.1. Discussion on GenM and s(f)

in Eq. (6) is the generalization error in terms of Rademacher complexity. It shows that when the encoder function is bounded and the number of samples is large enough, obtained by minimizing provides performance guarantee. Note that when satisfying is a bounded Lipschitz continuous function and encoder is bounded, the generalization error of different contrastive loss terms will only differ in the contraction rate, i.e., Lipschitz continuous constant.

in Eq. (6) can be further rewritten as

 s(f)=4√E(xi,xj)∼Dsim(xi,xj)[f(xi)Tf(xj)f(xj)Tf(xi)]=4√Ec∼ρ[Exi∼Dc[f(xi)TExj∼Dc[f(xj)f(xj)T]f(xi)]]≤4√Ec∼ρ[maxx∼Dc∥f(x)∥2×∥M(f,c)∥2],

where . It shows that NCEloss is prone to be disturbed by large representation norms.

#### 4.2.2. Cases that make contrastive learning suboptimal

There are two cases where contrastive learning algorithms cannot guarantee that works in downstream tasks as pointed out in (Saunshi et al., 2019), which also applies for NCELoss. Case 1. The optimal for the downstream task can have large () and thus failure of the algorithm, because of large spurious components in the representations that are orthogonal to the separation plain in the downstream task. Case 2. High intraclass deviation makes () large even if both of its supervised counterpart losses and () are small, resulting in failure of the algorithm.

There is an additional case for NCEloss (Case 3). The optimal for the downstream task can have large and , which lead to large and large , even if gives low intraclass deviation.

Example.  Figure 1 depicts an example with , , , and and . In this example, the linear separability of is better than in both Figure 0(a) and 0(b), while . In the case of (case 1 and 3), the contrastive learning algorithm using NCEloss will converge to pick since and . When , will be chosen, since in Eq.(6) (case 3).

We remark that once we avoid the problems of case 3, the problems of case 1 and case 2 cannot be serious. For case 1, since when both and are not large, the length of the orthogonal project is comparable to the separation plane so as to avoid destroying the contrast ability. For case 2, mild scale and avoids large intra-class deviation caused by large variance of the representation norm.

We further show that case 3 is not an artificial case, but exists in practice. We use the training status of only using one contrastive loss computed by structural similarity on the Cora dataset

(Yang et al., 2016) (Section 6) to demonstrate the issues with high expectation and high variance of representation norms. Denote

 μ+\coloneqqEc∼ρ[E(xi,xj)∼Dsim[|f(xi)Tf(xj)|]]μ−\coloneqqEc1≠c2,c1,c2∼ρ[Exi∼Dc1,xj∼Dc2[|f(xi)Tf(xj)|]]

Figure 2 shows the variance and the mean of representation norms (left top), the ratio of to (right top), and

(left bottom), and the testing accuracy (in a node classification task) during training for 300 epochs. The variance and mean value of representation norms increase with the progression of epochs. This increases

and significantly, while the ratio and representation quality (indicated by the test accuracy) decrease. In the following sections, we use the ratio to measure the contrasting ability of models, i.e., the ability to contrast different classes.

## 5. Contrastive Regularization

The theoretical analysis in Section 4 shows that a good contrastive representation learning algorithm should satisfy the following conditions: 1. avoiding large representation norm ; 2. avoiding large norm variance ; 3. preserving contrast.

We remark that the norm variance measures how far the norms of node representations are from their average value. It is different from intra-class variance  (Saunshi et al., 2019)

, which is the largest eigenvalue of covariance matrix

. This is also the reason why case 2 is different from case 3 in the previous example.

In order to satisfy the above conditions, we propose a contrastive regularization term, Contrast-Reg:

 (10) Lreg=−Ex,~x[logσ(f(x)TWr)+logσ(−f(~x)TWr)],

where is a random vector uniformly sampled from , is the trainable parameter, and is the noisy features. Different data augmentation techniques such as (Chen et al., 2020; Velickovic et al., 2019) can be applied to generate the noisy features. In Section 6, we will discuss how we calculate the noisy features in the GNN setting.

We give the motivation of Contrast-Reg’s design as follows. Consider an artificial downstream task that is to learn a classifier to discriminate the representations of regular data and noisy data. in Eq. (10) can be viewed as the classification loss and can be viewed as the parameter of the bi-linear classifier. The classifier prefers the encoder that can make the representations of intra-class data points more condensed and the inter-class representations more separable. We use the GCN model on the Cora dataset (Yang et al., 2016) as an example. Figure 2(a) and Figure 2(b) show the t-SNE visualization of and before and after the optimization on , respectively.

We can observe that the learned representations are closer to each other (i.e., the range of representations in Figure 2(a) are smaller than that in Figure 2(b)), while preserving the separability among the representations (i.e., the points with the same label share the same color).

### 5.1. Theoretical Guarantees for Contrast-Reg

Before stating the theorem, we give the following lemma to show that can be effectively reduced when is large by adding Contrast-Reg.

###### Lemma 5.1 ().

For a random variable , a constant and a constant , we have

 (11) Var(√(X+τ1+eX)2+c2)
###### Proof.

First, we consider

 h(x)=√(x+τ1+ex)2+c2−√x2+c2,

where is strictly decreasing in and strictly increasing in , and is the solution of . Thus, we can approximate the range of by the fact that for all and .

Thus, for ,

 √(x+τ1+ex)2+c2−√x2+c2<√(y+τ1+ey)2+c2−√y2+c2

and since is monotonically increasing, we get

 0<√(x+τ1+ex)2+c2−√(y+τ1+ey)2+c2<√x2+c2−√y2+c2.

When ,

 √x2+c2−√y2+c2<√(x+τ1+ex)2+c2−√(y+τ1+ey)2+c2<0.

Further, we assume that and are i.i.d. random variables sampled from ,

 Var(√(X+τ1+eX)2+c2)=12×EX,Y⎡⎣(√(X+τ1+eX)2+c2−√(Y+τ1+eY)2+c2)2⎤⎦=12×∫(√(x+τ1+ex)2+c2−√(y+τ1+ey)2+c2)2p(x)p(y)dxdy<12×∫(√x2+c2−√y2+c2)2p(x)p(y)dxdy=Var(√X2+c2)

###### Theorem 5.2 ().

Minimizing Eq. (10) induces the decrease in when .

###### Proof.

We minimize by gradient descent with learning rate .

 ∂∂f(x)Lreg=−σ(−f(x)TWr)Wr
 (12) f(x)←f(x)+β(σ(−f(x)TWr)Wr)

Eq. (12) shows that in every optimization step, extends by along . If we do orthogonal decomposition for along and its unit orthogonal hyperplain , . Thus we have

 (13) ∥f(x)∥=√(f(x)Tr0)2+(f(x)TΠ(r0))2.

The projection of along is while the projection of plus the Contrast-Reg update along is

Note that .

Based on Lemma 5.1 and Eq. (13),

when and , we have

 (14) Var(∥∥(f(x))reg∥∥)

###### Remark 1 ().

, which is the condition of Eq. (14) , is not difficult to satisfy, since the magnitude of could be tuned. In practice, can fit in all our experiments.

###### Remark 2 ().

The range of in Theorem 5.2 is not a tight bound for in Lemma 5.1. Since when Eq. (10) converges, is much larger than 1.5 for almost all the samples empirically, we prove the case for .

### 5.2. Understanding the effects of Contrast-Reg

Theorem 5.2 shows that Contrast-Reg can reduce when is large, which is proved from the geometric perspective. Figure 4 visualizes the geometric interpretation of Contrast-Reg. In one gradient descent step, and are the representation before and after the gradient descent update of Contrast-Reg. For any data point pair and , we decompose and along and its orthogonal direction. Minimizing Eq. (10) is consequently extending in each step along while preserving the length in its orthogonal direction. When we compare and , together with and , we conclude that .

Note that the mean and the variance are positively correlated. When we compare with , the former has higher norm and variance for . Theorem 5.2 shows that our Contrast-Reg can reduce when is large, and it should also prefer lower mean because the variance will increase when the norm is scaled to a larger value. Figure 5 shows that the mean and variance are reduced when we apply Contrast-Reg  compared to only using one contrastive loss computed by structural similarity. Also, the representation quality is improved significantly.

We also discuss two questions regarding the effects of Contrast-Reg. Does Contrast-Reg degrade the contrastive learning algorithm to a trivial solution, i.e., all representations converge to one point, even to the origin? We highlight that Theorem 5.2 is to reduce rather than , and Contrast-Reg does not force all the representations into one point. From Figure 4, we know that adding Contrast-Reg only reduces the variance in the representations along the direction of , while preserving the difference along its orthogonal direction. Therefore, Contrast-Reg not only reduces , but also preserves the contrasting ability of the contrastive learning algorithm, as shown in Figure 2(b). Furthermore, as we randomize in each training step, the variance reduction on the representation norm is conducted on various directions. Thus, the representations will not have the same dominant direction, and Contrast-Reg does not make the representations converge to one point.

Can other regularization/normalization methods, e.g., weight decay or final representation normalization, solve the norm problem? Other regularization and normalization techniques may help stabilize the mean of the representation norms and reduce their variance, but they cannot replace Contrast-Reg as Contrast-Reg leads to more stable changes in the representation norms and preserves high contrast between positive and negative samples.

We compare Contrast-Reg with weight decay and normalization and show their performance during a 300-epoch training process. Figure 6 shows the performance of weight decay and Contrast-Reg. From the left-top figure, we can see that Contrast-Reg gives smaller and more stable representation norms than weight delay. Specifically, although a large weight decay rate can be applied to obtain smaller representation norms, it leads to fluctuation. The fluctuation in the representation norms impairs the training process, as reflected by the fluctuation in training losses (left-bottom figure). In addition, from the ratio , we know that Contrast-Reg preserves better contrasting ability.

Chen et al. (2020) show that adding

normalization (i.e., using cosine similarity rather than inner product) with a temperature parameter improves the representation quality empirically. Figure

7 compares Contrast-Reg with normalization. normalization gives much smaller than Contrast-Reg, meaning that the separation between contrastive pairs provided by normalization is not as clear as that provided by Contrast-Reg. This is because normalization not only minimizes the variance in the representation norms, but also reduces the differences among the representations, rendering smaller contrast among the representations of data points in different classes. Thus, normalization gives less improvement in the representation quality than Contrast-Reg.

## 6. A Contrastive GNN Framework

We present our contrastive GNN framework in Algorithm 1. Given a graph and node attributes , we train a GNN model for epochs. Node representations can be obtained through and used as the input to downstream tasks. We adopt NCEloss as the contrastive loss in our framework. For each training epoch, we first select a seed node set for computing NCEloss by (Line 3). Then Line 4 invokes to construct a positive sample and a negative sample for each node in , where returns a set of 3-tuples consisting of the representations for the seed nodes, the positive samples and the negative samples, respectively. After that, Line 5 computes the training loss by adding NCEloss on and the regularization loss calculated by Contrast-Reg, and Line 6 updates by back-propagation.

As mentioned in Section 5, Contrast-Reg requires noisy features for contrastive regularization. In our contrastive GNN framework, we generate the noisy features by simply shuffling node features among nodes following the corruption function in (Velickovic et al., 2019).

For different node similarity definitions, different functions are designed to select seed nodes that bring good training effects, while different functions are designed to generate suitable contrastive pairs for the seed nodes. In the following, we demonstrate by examples the designs of three contrastive GNN models for structure, attribute, and proximity similarity, respectively, which are also used in our experimental evaluation in Section 7.

### 6.1. Structure Similarity

We give an example model () that captures the community structure inherent in graph data (Newman, 2006). As clustering is a common and effective method for detecting communities in a graph, we conduct clustering in the node representation space to capture community structures. LC borrows the design from (Huang et al., 2019) and implements local clustering by and curriculum learning by . We remark that other methods such as global clustering (Caron et al., 2018) and instance discrimination (Wu et al., 2018) can also be adapted into our contrastive GNN framework by different implementations of and .

Algorithm 2 shows the implementation of and in LC. For each seed node , generates a positive node from the nodes that have the highest similarity scores with , and a negative node randomly sampled from (Lines 2-4). selects nodes with the smallest entropy to avoid high randomness and uncertainty at the start of the training. For every epochs, gradually adds more nodes with larger entropy to be computed in the contrastive loss with the progression of epochs (Lines 11-13).

### 6.2. Attribute Similarity

Models adopting attribute similarity assume that nodes with similar attributes are expected to have similar representations, so that the attribute information should be preserved. Hjelm et al. (2019); Peng et al. (2020) proposed contrastive pair designs to maximize the mutual information between low-level representations (the input features) and high-level representations (the learned representations). Algorithm 3 presents our model, , which adapts their multi-level representation design into our contrastive GNN framework.

In Algorithm 3, selects all nodes in a graph as seeds. uses node itself as the positive node for each seed node , but in the returned 3-tuple, the representation of as the seed node is different from the representation of as the positive node. The second element in the 3-tuple is ’s representation , while the first element is calculated by stacking an additional GNN layer upon . For negative nodes, randomly samples a node in for each seed node.

### 6.3. Proximity Similarity

The assumption behind proximity similarity is that nodes are expected to have similar representation when they have high proximity (i.e., they are near neighbors). To capture proximity information among nodes, we implement and following the setting in unsupervised GraphSAGE (Kipf and Welling, 2017). Adjacent nodes are selected to be positive pairs, while negative pairs are sampled from non-adjacent nodes.

## 7. Experimental Results

In this section, we first show that Contrast-Reg can be generally used for various designs of contrastive pairs. Then we evaluate the performance of various models trained with Contrast-Reg in both graph representation learning and pretraining settings, where contrastive learning is successfully applied.

Datasets. The datasets we used include citation networks such as Cora, Citeseer, Pubmed (Yang et al., 2016) and ogbn-arxiv (Hu et al., 2020a), web graphs such as Wiki (Yang et al., 2015), co-purchase networks such as Computers, Photo (Shchur et al., 2018) and ogbn-products (Hu et al., 2020a), and social networks such as Reddit (Hamilton et al., 2017). Some statistics of the datasets are given in Table 1.

Models. We denote Algorithm 2 (local clustering) capturing structure similarity as ours (LC), Algorithm 3 (multi-level representations) capturing attribute similarity as ours (ML) and the algorithm (co-occurrence) capturing proximity similarity as ours (CO).

Unsupervised training procedure. We used full batch training for Cora, Citeseer, Pubmed, ogbn-arxiv, Wiki, Computers and Photo, while we used stochastic mini-batch training for Reddit and ogbn-products. For Cora, Citeseer, Pubmed, ogbn-arxiv, ogbn-products and Reddit, we used the standard split provided in the datasets and fixed the randoms seeds from 0 to 9 for 10 different runs. For Computers, Photo and Wiki, we randomly split the train/validation/test as 20 nodes/30 nodes/all the remaining nodes per class, as recommended in (Shchur et al., 2018). The performance was measured on 25 different runs, with 5 random splits and 5 fixed-seed runs (from 0 to 4) for each random split. For Wiki, we removed the edge attributes for all models for fair comparison. The additional special designs for link prediction task and pretraining setting are given in their respective subsections.

### 7.1. Generalizability of Contrast-Reg

To evaluate the performance gain by Contrast-Reg, we tested the model performance (on node classification accuracy) with and without Contrast-Reg on four networks from four different domains. GCN encoder was used on Cora, Wiki and Computers. Graph-Sage with GCN-aggregation encoder was used on Reddit. Table 2 shows that Contrast-Reg can help better capture different types of similarity, i.e., ML for attribute similarity, LC for structure similarity, and CO for proximity similarity, and improve the performance of the models in all cases. In the following experiments, we omit CO since its contrast loss is computed by sampled edges and thus the computation cost is larger than the other two contrast designs.

### 7.2. Graph Representation Learning

Next we show that high quality representations can be learned by our method. High quality means using these representations, simple models (e.g., linear classifier for classification and -means for clustering) can easily achieve high performance on various downstream tasks. To show that, we evaluated the learned representations on three downstream tasks that are fundamental for graph analytics: node classification, graph clustering, and link prediction.

#### 7.2.1. Node Classification

We evaluated the performance of node classification on all datasets, using both full batch training and stochastic mini-batch training. We compared our methods with DGI (Velickovic et al., 2019), GMI (Peng et al., 2020), node2vec (Grover and Leskovec, 2016), and supervised GCN (Kipf and Welling, 2017). DGI and GMI are the state-of-the-art algorithms in unsupervised graph representation learning. Node2vec is the representative algorithm for random walk based graph representation algorithms (Grover and Leskovec, 2016; Tang et al., 2015; Perozzi et al., 2014). GCN is a classic supervised GNN model. We report ours (LC) and ours (ML), both using Contrast-Reg, where their GNN encoder is GCN for full batch training and GraphSage (Hamilton et al., 2017) with GCN-aggregation for stochastic training, respectively. The encoder settings are the same as in DGI and GMI. Our framework can also adopt other encoder such as GAT and similar performance improvements over GAT can also be obtained. We omit the detailed results due to the page limit.

Table 3

reports node classification accuracy with standard deviation. The results show that our algorithms achieve better performance in the majority of the cases, for both full batch training (on Cora, Citeseer, Pubmed, Computers, Photo and Wiki) and stochastic training (on Reddit and Ogbn-products). Our unsupervised algorithms can even outperform the supervised GCN. Compared with DGI and GMI, our model is similar to the DGI model with a properly designed contrast pair and to the GMI model with the Contrast-Reg term, and thus we can achieve better performance in most cases. If we compare Table

2 and Table 3, it shows that the performance gain is from Contrast-Reg rather than a more proper contrast design.

#### 7.2.2. Graph Clustering

Following the work of Xia et al. (2014), we used three metrics to evaluate the clustering performance: accuracy (Acc), normalized mutual information (NMI), and F1-macro (F1). For all these three metrics, a higher value indicates better clustering performance. We compared our methods with DGI, node2vec, GMI, and AGC (Zhang et al., 2019) on Cora, Citeseer and Wiki. AGC (Zhang et al., 2019) is a state-of-the-art graph clustering algorithm, which exploits high-order graph convolution to do attribute graph clustering. For all models and all datasets, we used -means to cluster both the labels and representations of nodes. The clustering results of labels are taken as the ground truth. Since high dimension is harmful to clustering (Chen, 2009), we applied the PCA algorithm to the representations to reduce the dimensionality before using -means. The random seed setting for model training was the same as that in the node classification task. And to reduce the randomness caused by -means, we set the random seed of clustering from 0 to 4, and took the average result for each learned representations. For each cell in Table 4, we report the better result with PCA and without PCA. The results show that our algorithms, especially ours (ML), achieve better performance in most cases, which again demonstrates the effectiveness of Contrast-Reg. Note that graph clustering is applied on attribute graphs, the fact that the results of ours (ML) are better than ours (LC) tells us that attributes play an important role in clustering.