GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training

06/17/2020 ∙ by Jiezhong Qiu, et al. ∙ Microsoft Tsinghua University 0

Graph representation learning has emerged as a powerful technique for real-world problems. Various downstream graph learning tasks have benefited from its recent developments, such as node classification, similarity search, graph classification, and link prediction. However, prior arts on graph representation learning focus on domain specific problems and train a dedicated model for each graph, which is usually non-transferable to out-of-domain data. Inspired by recent advances in pre-training from natural language processing and computer vision, we design Graph Contrastive Coding (GCC) – an unsupervised graph representation learning framework – to capture the universal network topological properties across multiple networks. We design GCC's pre-training task as subgraph-level instance discrimination in and across networks and leverage contrastive learning to empower the model to learn the intrinsic and transferable structural representations. We conduct extensive experiments on three graph learning tasks and ten graph datasets. The results show that GCC pre-trained on a collection of diverse datasets can achieve competitive or better performance to its task-specific trained-from-scratch counterparts. This suggests that the pre-training and fine-tuning paradigm presents great potential for graph representation learning.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Hypothesis 0 ().

Representative graph structural patterns are universal and transferable across networks.

Over the past two decades, the main focus of network science research has been on discovering and abstracting the universal structural properties underlying different networks. For example, Barabasi and Albert show that several types of networks, e.g., World Wide Web, social, and biological networks, have the scale-free property, i.e., all of their degree distributions follow a power law (Albert and Barabási, 2002). Leskovec et al. (2005) discover that a wide range of real graphs satisfy the densification and shrinking laws. Other common patterns across networks include small world (Watts and Strogatz, 1998), motif distribution (Milo et al., 2004), community organization (Newman, 2006), and core-periphery structure (Borgatti and Everett, 2000), validating our hypothesis at the conceptual level.

In the past few years, however, the paradigm of graph learning has been shifted from structural pattern discovery to graph representation learning (Perozzi et al., 2014; Tang et al., 2015; Grover and Leskovec, 2016; Dong et al., 2017; Kipf and Welling, 2017; Qiu et al., 2018a; Xu et al., 2019; Qiu et al., 2019)

, motivated by recent advances in deep learning 

(Mikolov et al., 2013; Battaglia et al., 2018)

. Specifically, graph representation learning converts the vertices, edges or subgraphs of a graph into low-dimensional embeddings such that vital structural information of the graph is preserved. The learned embeddings from the input graph can be then fed into standard machine learning models for downstream tasks on the same graph.

However, most representation learning work on graphs has thus far focused on learning representations for one single graph or a fixed set of graphs and very limited work can be transferred to out-of-domain data and tasks. Essentially, those representation learning models aim to learn network-specific structural patterns dedicated for each dataset. For example, the DeepWalk embedding model (Perozzi et al., 2014) learned on the Facebook social graph cannot be applied to other graphs. In view of (1) this limitation of graph representation learning and (2) the prior arts on common structural pattern discovery, a natural question arises here: can we universally learn transferable representative graph embeddings from networks?

The similar question has also been asked and pursued in natural language processing (Devlin et al., 2019), computer vision (He et al., 2020), and other domains. To date, the most powerful solution is to pre-train a representation learning model from a large dataset, that is, self-supervised representation learning. The idea of pre-training is to use the pre-trained model as a good initialization for fine-tuning over (different) tasks on unseen datasets. For example, BERT (Devlin et al., 2019) designs language model pre-training tasks to learn a Transformer encoder (Vaswani et al., 2017) from a large corpus. The pre-trained Transformer encoder is then adapted to various NLP tasks (Wang et al., 2019a) by fine-tuning.

Presented Work. Inspired by this and the existence of universal graph structural patterns, we propose to study the potential of pre-training representation learning models for graphs. Ideally, given a (diverse) set of input graphs, such as the Facebook social graph and the DBLP co-author graph, we aim to pre-train a representation learning model from them with a self-supervised task, and then fine-tune it on different graphs with different graph learning tasks, such as node classification on the US-Airport graph. The critical question for graph pre-training here is: how to design the pre-training task such that the universal structural patterns in and across networks can be captured and further transferred?

In this work, we present the Graph Contrastive Coding (GCC) model to learn structural representations across graphs. Conceptually, we leverage the idea of contrastive learning (Wu et al., 2018) to design GCC’s pre-training task as instance discrimination. Its basic idea is to sample instances from input graphs, treat each of them as a distinct class of its own, and learn to encode and discriminate between these instances. Specifically, there are three questions to answer for GCC such that it can learn the transferable structural patterns: (1) what are the instances? (2) what are the discrimination rules? And (3) how to encode the instances?

In GCC’s pre-training stage, we propose to distinguish vertices according to their local structures (Cf. Figure 1). For each vertex, we sample subgraphs from its multi-hop ego network as instances. GCC aims to distinguish between subgraphs sampled from a certain vertex and subgraphs sampled from other vertices. Finally, for each subgraph, we use a graph neural network (specifically, the GIN model (Xu et al., 2019)) as the graph encoder to map the underlying structural patterns to latent representations. As GCC does not assume vertices and subgraphs come from the same graph, the graph encoder is forced to capture universal patterns across different input graphs. Given the pre-trained GCC model, we apply it to unseen graphs for various downstream tasks. GCC is able to measure the structural similarity between two vertices from different domains, e.g., a user from Facebook friendship network and a researcher from DBLP co-authorship network, as shown in Figure 1.

To the best of our knowledge, very limited work exists in the field of structural graph representation pre-training to date. A very recent one is to design strategies for pre-training GNNs on labeled graphs with node attributes for specific domains (molecular graphs) (Hu et al., 2019a). Another recent work is InfoGraph (Sun et al., 2019), which focuses on learning domain-specific graph-level representations, especially for graph classification tasks. The third related work is by Hu et al. (2019b), who define several graph learning tasks, such as predicting centrality scores, to pre-train a GCN (Kipf and Welling, 2017) model on synthetic graphs.

We conduct extensive experiments to demonstrate the performance and transferability of GCC. We pre-train the GCC model on a collection of diverse types of graphs and apply the pre-trained model to three downstream graph learning tasks on ten new graph datasets. The results suggest that the GCC model achieves competitive or better results to the state-of-the-art task-specific graph representation learning models that are trained from scratch. For example, for node classification on the US-Airport network, GCC pre-trained on the Facebook, IMDB, and DBLP graphs outperforms GraphWave (Donnat et al., 2018), ProNE (Zhang et al., 2019b) and Struc2vec (Ribeiro et al., 2017) which are trained directly on the US-Airport graph, empirically demonstrating our hypothesis at the beginning.

To summarize, our work makes the following four contributions:

  • [leftmargin=*,itemsep=0pt,parsep=0.5em,topsep=0.3em,partopsep=0.3em]

  • We formalize the problem of structural graph representation pre-training across multiple graphs and identify its design challenges.

  • We design the pre-training task as subgraph-level instance discrimination to capture universal and transferable structural patterns from multiple input graphs.

  • We present the Graph Contrastive Coding (GCC) framework to learn structural graph representations, which leverages contrastive learning to guide the pre-training.

  • We conduct extensive experiments to demonstrate that for out-of-domain tasks, GCC can offer comparable or superior performance over dedicated graph-specific models.

Figure 1. An illustrative example of GCC. In this example, GCC aims to measure the structural similarity between a user from the Facebook friendship network and a researcher from the DBLP co-authorship network.

2. Related Work

In this section, we review related work of vertex similarity, contrastive learning and graph pre-training.

2.1. Vertex Similarity

Quantifying similarity of vertices in networks/graphs has been extensively studied in the past years. The goal of vertex similarity is to answer questions (Leicht et al., 2006) like “How similar are these two vertices?” or “Which other vertices are most similar to these vertices?” The definition of similarity can be different in different situations. We briefly review the following three types of vertex similarity.

Neighborhood similarity The basic assumption of neighborhood similarity, a.k.a., proximity, is that vertices closely connected should be considered similar. Early neighborhood similarity measures include Jaccard similarity (counting common neighbors), RWR similarity (Pan et al., 2004) and SimRank (Jeh and Widom, 2002), etc. Most recently developed network embedding algorithms, such as LINE (Tang et al., 2015), DeepWalk (Perozzi et al., 2014), node2vec (Grover and Leskovec, 2016), also follow the neighborhood similarity assumption.

Structural similarity Different from neighborhood similarity which measures similarity by connectivity, structural similarity doesn’t even assume vertices are connected. The basic assumption of structural similarity is that vertices with similar local structures should be considered similar. There are two lines of research about modeling structural similarity. The first line defines representative patterns based on domain knowledge. Examples include vertex degree, structural diversity (Ugander et al., 2012), structural hole (Burt, 2009), k-core (Alvarez-Hamelin et al., 2006), motif (Milo et al., 2002; Benson et al., 2016), etc. Consequently, models of this genre, such as Struc2vec (Ribeiro et al., 2017) and RolX (Henderson et al., 2012), usually involve explicit featurization. The second line of research leverages the spectral graph theory to model structural similarity. A recent example is GraphWave (Donnat et al., 2018). In this work, we focus on structural similarity. Unlike the above two genres, we adopt contrastive learning and graph neural networks to learn structural similarity from data.

Attribute similarity Real world graph data always come with rich attributes, such as text in citation networks, demographic information in social networks, and chemical features in molecular graphs. Recent graph neural networks models, such as GCN (Kipf and Welling, 2017), GAT (Veličković et al., 2018), GraphSAGE (Hamilton et al., 2017; Ying et al., 2018) and MPNN (Gilmer et al., 2017), leverage additional attributes as side information or supervised signals to learn representations which are further used to measure vertex similarity.

2.2. Contrastive Learning

Contrastive learning is a natural choice to capture similarity from data. In natural language processing, Word2vec (Mikolov et al., 2013) model uses co-occurring words and negative sampling to learn word embeddings. In computer vision, a large collection of work (Hadsell et al., 2006; Wu et al., 2018; He et al., 2020; Tian et al., 2019) learns self-supervised image representation by minimizing the distance between two views of the same image. In this work, we adopt the InfoNCE loss from Oord et al. (2018) and instance discrimination task from Wu et al. (2018), as discussed in Section 3.

2.3. Graph Pre-training

Skip-gram based model Early attempts to pre-train graph representations are skip-gram based network embedding models inspired by Word2vec (Mikolov et al., 2013), such as LINE (Tang et al., 2015), DeepWalk (Perozzi et al., 2014), node2vec (Grover and Leskovec, 2016), and metapath2vec (Dong et al., 2017). Most of them follow the neighborhood similarity assumption, as discussed in section 2.1. The representations learned by the above methods are tied up with graphs used to train the models, and can not handle out-of-sample problems. Our Graph Contrastive Coding (GCC) differs from these methods in two aspects. First, GCC focuses on structural similarity, which is orthogonal to neighborhood similarity. Second, GCC can be transferred across graphs, even to graphs never seen during pre-training.

Pre-training graph neural networks There are several recent efforts to bring ideas from language pre-training (Devlin et al., 2019) to pre-training graph neural networks (GNN). For example, Hu et al. (2019a) pre-train GNN on labeled graphs, especially molecular graphs, where each vertex (atom) has an atom type (such as C, N, O), and each edge (chemical bond) has a bond type (such as the single bond and double bond). The pre-training task is to recover atom types and chemical bond types in masked molecular graphs. Another related work is by Hu et al. (2019b), which defines several graph learning tasks to pre-train a GCN (Kipf and Welling, 2017). Our GCC framework differs from the above methods in two aspects. First, GCC is for general unlabeled graphs, especially social and information networks. Second, GCC does not involve explicit featurization and pre-defined graph learning tasks.

3. Graph Representation Pre-training

Figure 2. A Running Example of GCC. Left: examples of two 2-ego networks with the red and blue vertices as the egos. Middle: a similar pair is randomly augmented from the red ego network, and two negative subgraphs, and , are randomly augmented from another ego network — the blue one. Right: these subgraph instances are encoded by graph neural networks and , after which the contrastive loss in Equation 1 encourages a higher similarity score between positive pairs than negative ones. Note that the two ego networks are not required to come from the same graph.

In this section, we formalize the structural graph representation pre-training problem. To address it, we present the Graph Contrastive Coding (GCC) framework.

3.1. The Graph Pre-Training Problem

Conceptually, given a collection of graphs from various domains, we aim to pre-train a model to capture structural patterns across these graphs in an unsupervised manner. The model should be able to benefit various graph learning tasks on different graphs. The underlying assumption is that there exist common and transferable structural patterns across different graphs such as the motif, k-core, and structural hole, as evident in network science literature (Milo et al., 2002; Burt, 2009; Ugander et al., 2012; Benson et al., 2016). One illustrative scenario is that we pre-train a model on Facebook, IMDB, and DBLP graphs with self-supervision, and apply it on the US-Airport network for node classification.

Formally, the structural graph representation pre-training problem is to learn a function

that maps a vertex to a low-dimensional feature vector, such that

has the following two properties:

  • [leftmargin=*,itemsep=0pt,parsep=0.5em,topsep=0.3em,partopsep=0.3em]

  • First, structural similarity, it maps vertices with similar local network topologies close to each other in the vector space;

  • Second, transferability, it is compatible with vertices and graphs unseen during pre-training.

As such, the embedding function can be adopted in various graph learning tasks, such as social role prediction, node classification, and graph classification.

Note that the focus of this work is on structural representation learning without node attributes and node labels, making it completely different from the problem setting in graph neural network research. In addition, the goal is to pre-train a structural representation model and apply it to unseen graphs, differing from traditional network embedding (Perozzi et al., 2014; Tang et al., 2015; Grover and Leskovec, 2016) and recent attempts on pre-training graph neural networks with attributed graphs as input and applying them within a specific domain (Hu et al., 2019a).

3.2. Graph Contrastive Coding Overview

To pre-train structural representations on graphs, we present the Graph Contrastive Coding (GCC) model. Given a set of graphs, our goal is to pre-train a universal graph neural network encoder to capture the structural patterns behind these graphs. To achieve this, we need to design proper self-supervised tasks and learning objectives for graph structured data. Inspired by the recent success of contrastive learning in CV (Wu et al., 2018; He et al., 2020) and NLP (Mikolov et al., 2013; Clark et al., 2019), we propose to use instance discrimination (Wu et al., 2018) as our pre-training task, and InfoNCE (Oord et al., 2018) as our learning objective. Our choice of pre-training task and learning objective treats each instance as a distinct class of its own and learns to discriminate between these instances (Wu et al., 2018; He et al., 2020). The promise is that it can output instance representations that are capable of capturing the similarities between instances. From a dictionary look-up perspective, given an encoded query and a dictionary of encoded keys , contrastive learning looks up a single key (denoted by ) that matches in the dictionary. In this work, we adopt InfoNCE (Oord et al., 2018) such that:


where is the temperature hyper-parameter. and are two graph neural networks that encode the query instance and each key instance to -dimensional representations, denoted by and .

To instantiate each component in GCC, we need to answer the following three questions:

  • [leftmargin=*,itemsep=0pt,parsep=0.5em,topsep=0.3em,partopsep=0.3em]

  • Q1: How to define instances in graphs?

  • Q2: How to define (dis) similar instance pairs in and across graphs, i.e., for a query , which key is the matched one?

  • Q3: What are the proper graph encoders and ?

It is worth noting that in our problem setting, and ’s are not assumed to be from the same graph.

3.3. GCC Design

In this part, we present the design strategies for the GCC model by correspondingly answering the aforementioned questions.

Q1: Design contrastive instances in graphs. The success of contrastive learning framework largely relies on the definition of the data instance. It is straightforward for CV and NLP tasks to define an instance as an image or a sentence. However, such ideas cannot be directly extended to graph data, as instances in graphs are not clearly defined. Moreover, our pre-training focus is purely on structural representations without additional input features/attributes. This leaves the natural choice of a single vertex as an instance infeasible, as it is not applicable to discriminate between two vertices. To address this issue, we propose to extend one single vertex to its local structure. Specially, for a certain vertex , we define an instance to be its -ego network:

Definition 3.1 ().

A -ego network. Let be a graph, where denotes the set of vertices and denotes the set of edges222In this work, we consider undirected edges.. For a vertex , its -neighbors are defined as where is the shortest path distance between and in the graph . The -ego network of vertex , denoted by , is the sub-graph induced by .

The left panel of Figure 2 shows two examples of 2-ego networks. GCC treats each -ego network as a distinct class of its own and encourages the model to distinguish similar instances from dissimilar instances. Next, we introduce how to define (dis)similar instances.

Q2: Define (dis)similar instances. In computer vision (He et al., 2020), two random data augmentations (e.g., random crop, random resize, random color jitering, random flip, etc) of the same image are treated as a similar instance pair. In GCC, we consider two random data augmentations of the same -ego network as a similar instance pair and define the data augmentation as graph sampling (Leskovec and Faloutsos, 2006). Graph sampling is a technique to derive representative subgraph samples from the original graph. Suppose we would like to augment vertex ’s -ego network (), the graph sampling for GCC follows the three steps—random walks with restart (RWR) (Tong et al., 2006), subgraph induction, and anonymization (Micali and Zhu, 2016; Jin et al., 2019).

  1. [leftmargin=*,itemsep=0pt,parsep=0.5em,topsep=0.3em,partopsep=0.3em]

  2. Random walk with restart. We start a random walk on from the ego vertex

    . The walk iteratively travels to its neighborhood with the probability proportional to the edge weight. In addition, at each step, with a positive probability the walk returns back to the starting vertex


  3. Subgraph induction. The random walk with restart collects a subset of vertices surrounding , denoted by . The sub-graph induced by is then regarded as an augmented version of the -ego network . This step is also known as the Induced Subgraph Random Walk Sampling (ISRW).

  4. Anonymization We then anonymize the sampled graph by re-labeling its vertices to be , in arbitrary order333Vertex order doesn’t matter because most of graph neural networks are invariant to permutations of their inputs (Battaglia et al., 2018)..

We repeat the aforementioned procedure twice to create two data augmentations, which form a similar instance pair , . If two subgraphs are augmented from different -ego networks, we treat them as a dissimilar instance pair (, ) with . It is worth noting that all the above graph operations — random walk with restart, subgraph induction, and anonymization — are available in the DGL package (Wang et al., 2019b).

Discussion on graph sampling. In random walk with restart sampling, the restart probability controls the radius of ego-network (i.e., ) which GCC conducts data augmentation on. In this work, we follow Qiu et al. (2018b) to use 0.8 as the restart probability. The proposed GCC framework is flexible to other graph sampling algorithms, such as neighborhood sampling (Hamilton et al., 2017) and forest fire (Leskovec and Faloutsos, 2006).

Discussion on anonymization. Now we discuss the intuition behind the anonymization step in the above data augmentation procedure. This step is designed to keep the underlying structural patterns and hide the exact vertex indices. This design avoids learning a trivial solution to instance discrimination, i.e., simply checking whether vertex indices of two sampled graphs match. Moreover, it facilitates the transfer of the learned model across different graphs as such a model is not associated with a particular vertex set.

Q3: Define graph encoders. Given two sampled subgraphs and , GCC encodes them via two graph neural network encoders and , respectively. Technically, any graph neural networks (Battaglia et al., 2018) can be used here as the encoder, and the GCC model is not sensitive to different choices. In practice, we adopt the Graph Isomorphism Network (GIN) (Xu et al., 2019), a state-of-the-art graph neural network model, as our graph encoder. Recall that we focus on structural representation pre-training while most GNN models require vertex features/attributes as input. To bridge the gap, we propose to leverage the graph structure of each sampled subgraph to initialize vertex features. Specifically, we define the generalized positional embedding as follows:

Definition 3.2 ().

Generalized positional embedding.

For each subgraph, its generalized positional embedding is defined to be the top eigenvectors of its normalized graph Laplacian. Formally, suppose one subgraph has adjacency matrix

and degree matrix , we conduct eigen-decomposition on its normalized graph Laplacian s.t. , where the top eigenvectors in  (Von Luxburg, 2007) are defined as generalized positional embedding.

The generalized positional embedding is inspired by the Transformer model in NLP (Vaswani et al., 2017), where the sine and cosine functions of different frequencies are used to define the positional embeddings in word sequences. Such a definition is deeply connected with graph Laplacian as follows.

Fact 0 ().

The Laplacian of path graph has eigenvectors: , for . Here is the number of vertices in the path graph, and is the entry at -th row and -the column of , i.e., .

The above fact shows that the positional embedding in sequence models can be viewed as Laplacian eigenvectors of path graphs. This inspires us to generalize the positional embedding from path graphs to arbitrary graphs. The reason for using the normalized graph Laplacian rather than the unnormalized version is that path graph is a regular graph (i.e., with constant degrees) while real-world graphs are often irregular and have skewed degree distributions. In addition to the generalized positional embedding, we also add the one-hot encoding of vertex degrees 

(Xu et al., 2019) and the binary indicator of the ego vertex (Qiu et al., 2018b) as vertex features. After encoded by the graph encoder, the final -dimensional output vectors are then normalized by their L2-Norm (He et al., 2020).

A running example. We illustrate a running example of GCC in Figure 2. For simplicity, we set the dictionary size to be 3, i.e., . GCC first randomly augment two subgraphs and from a 2-ego network on the left panel of Figure 2. Meanwhile, another two subgraphs, and , are generated from a noise distribution — in this example, they are randomly augmented from another 2-ego network on the left panel of Figure 2. Then the two graph encoders, and , map the query and the three keys to low-dimensional vectors — and . Finally, the contrastive loss in Eq. 1 encourages the model to recognize as a similar instance pair and distinguish them from dissimilar instances, i.e., .

3.4. GCC Learning

In contrastive learning, it is required to maintain the -size dictionary and encoders. Ideally, in Eq. 1, the dictionary should cover as many instances as possible, making extremely large. However, due to the computational constraints, we usually design and adopt economical strategies to effectively build and maintain the dictionary, such as end-to-end (E2E) and momentum contrast (MoCo) (He et al., 2020). We discuss the two strategies as follows.

E2E samples mini-batches of instances and considers samples in the same mini-batch as the dictionary. The objective in Eq. 1 is then optimized with respect to parameters of both and

, both of which can accept gradient updates by backpropagation consistently. The main drawback of E2E is that the dictionary size is constrained by the batch size.

MoCo is designed to increase the dictionary size without additional backpropagation costs. Concretely, MoCo maintains a queue of samples from preceding mini-batches. During optimization, MoCo only updates the parameters of  (denoted by ) by backpropagation. The parameters of  (denoted by ) are not updated by gradient descent. He et al. (2020) propose a momentum-based update rule for . More formally, MoCo updates by , where is a momentum hyper-parameter. The above momentum update rule gradually propagates the update in to , making evolve smoothly and consistently. In summary, MoCo achieves a larger dictionary size at the expense of dictionary consistency, i.e., the key representations in the dictionary are encoded by a smoothly-varying key encoder.

In addition to E2E and MoCo, there are other contrastive learning mechanisms to maintain the dictionary, such as memory bank (Wu et al., 2018). Recently, He et al. (2020) show that MoCo is a more effective option than memory bank in computer vision tasks. Therefore, we mainly focus on E2E and MoCo for GCC.

3.5. GCC Fine-Tuning

Downstream tasks. Downstream tasks in graph learning generally fall into two categories — graph-level and node-level, where the target is to predict labels of graphs or nodes, respectively. For graph-level tasks, the input graph itself can be encoded by GCC to achieve the representation. For node-level tasks, the node representation can be defined by encoding its -ego networks (or subgraphs augmented from its -ego network). In either case, the encoded representations are then fed into downstream tasks to predict task-specific outputs.

Freezing vs. full fine-tuning. GCC offers two fine-tuning strategies for downstream tasks — freezing mode and full fine-tuning mode. In the freezing mode, we freeze the parameters of the pre-trained graph encoder

and treat it as a static feature extractor, then the classifiers catering for specific downstream tasks are trained on top of the extracted features. In the full fine-tuning mode, the graph encoder

is trained end-to-end together with the classifier on a downstream task. More implementation details about fine-tuning are available in Section 4.2.

GCC as a local algorithm. As a graph algorithm, GCC belongs to the local algorithm category (Spielman and Teng, 2013; Teng and others, 2016), in which the algorithms only involve local explorations of the input (large-scale) network, since GCC explores local structures by random walk based graph sampling. Such property enables GCC to scale to large-scale graph learning tasks and to be friendly to the distributed computing setting.

4. Experiments

In this section, we evaluate GCC on three graph learning tasks — node classification, graph classification, and similarity search, which have been commonly used to benchmark graph learning algorithms (Yanardag and Vishwanathan, 2015; Ribeiro et al., 2017; Donnat et al., 2018; Xu et al., 2019; Sun et al., 2019). We first introduce the self-supervised pre-training settings in Section 4.1

, and then report GCC transfer learning results on three graph learning tasks in Section 


4.1. Pre-Training

Dataset Academia DBLP (SNAP) DBLP (NetRep) IMDB Facebook LiveJournal
137,969 317,080 540,486 896,305 3,097,165 4,843,953
739,384 2,099,732 30,491,458 7,564,894 47,334,788 85,691,368
Table 1. Datasets for pre-training, sorted by number of vertices.

Datasets. Our self-supervised pre-training is performed on six graph datasets, which can be categorized into two groups — academic graphs and social graphs. As for academic graphs, we collect the Academia dataset from NetRep (Ritchie et al., 2016) as well as two DBLP datasets from SNAP (Yang and Leskovec, 2015) and NetRep (Ritchie et al., 2016), respectively. As for social graphs, we collect Facebook and IMDB datasets from NetRep (Ritchie et al., 2016), as well as a LiveJournal dataset from SNAP (Backstrom et al., 2006). Table 1 presents the detailed statistics of datasets for pre-training.

Pre-training settings. We train for 75,000 steps and use Adam (Kingma and Ba, 2015) for optimization with learning rate of 0.005, , weight decay of 1e-4, learning rate warmup over the first steps, and linear decay of the learning rate after steps. Gradient norm clipping is applied with range . For MoCo, we use mini-batch size of , dictionary size of , and momentum of . For E2E, we use mini-batch size of . For both MoCo and E2E, the temperature is set as , and we adopt GIN (Xu et al., 2019) with 5 layers and 64 hidden units each layer as our encoders. Detailed hyper-parameters can be found in Table 6 in the Appendix.

4.2. Downstream Task Evaluation

In this section, we apply GCC to three graph learning tasks including node classification, graph classification, and similarity search. As prerequisites, we discuss the two fine-tuning strategies of GCC as well as the baselines we compare with.

Fine-tuning. As we discussed in Section 3.5

, we adopt two fine-tuning strategies for GCC. We select logistic regression or SVM from the scikit-learn 

(Pedregosa et al., 2011) package as the linear classifier for the freezing strategy444 In node classification tasks, we follow Struc2vec to use logistic regression. For graph classification tasks, we follow DGK (Yanardag and Vishwanathan, 2015) and GIN (Xu et al., 2019) to use SVM (Chang and Lin, 2011).

. As for the full fine-tuning strategy, we use Adam optimizer with learning rate 0.005, learning rate warmup over the first 3 epochs, and linear learning rate decay after 3 epochs.

Baselines. Baselines can be categorized into two categories. In the first category, the baseline models learn vertex/graph representations from unlabeled graph data and then feed them into logistic regression or SVM. Examples include DGK (Yanardag and Vishwanathan, 2015), Struc2vec (Ribeiro et al., 2017), GraphWave (Donnat et al., 2018), graph2vec(Narayanan et al., 2017) and InfoGraph (Sun et al., 2019). GCC with freezing setting belongs to this category. In the second category, the models are optimized in an end-to-end supervised manner. Examples include DGCNN (Zhang et al., 2018) and GIN (Xu et al., 2019). GCC with the full fine-tuning setting belongs to this category. For a fair comparison, we fix the representation dimension of all models to be 64 except graph2vec and InfoGraph555We allow them to use their preferred dimension size in their papers: graph2vec uses 1024 and InfoGraph uses 512.. The details of baselines will be discussed later.

4.2.1. Node Classification

Setup. Node classification task is to predict unknown node labels in a partially labeled network. To evaluate GCC, we sample a subgraph centered at each vertex and apply GCC on it. Then the obtained representation is fed into an output layer to predict the node label. As for datasets, we adopt US-Airport (Ribeiro et al., 2017) and H-index (Zhang et al., 2019a)

. US-Airport consists of the airline activity data among 1,190 airports. The 4 classes indicate different activity levels of the airports. H-index is a co-authorship graph extracted from OAG 

(Zhang et al., 2019a). The labels indicate whether the h-index of the author is above or below the median.

Experimental results. We compare GCC with ProNE (Zhang et al., 2019b), GraphWave (Donnat et al., 2018), and Struc2vec (Ribeiro et al., 2017). Table 2 represents the results. It is worth noting that, under the freezing setting, the graph encoder in GCC is not trained on either US-Airport or H-Index dataset, which other baselines use as training data. This places GCC at a disadvantage. However, GCC (MoCo, freeze) performs competitively to Struc2vec in US-Airport, and achieves the best performance in H-index where Struc2vec cannot finish in one day. Moreover, GCC can be further boosted by fully fine-tuning on the target US-Airport or H-Index domain.

Datasets US-Airport H-index
1,190 5,000
13,599 44,020
ProNE 62.3 69.1
GraphWave 60.2 70.3
Struc2vec 66.2 ¿ 1 Day
GCC (E2E, freeze) 64.8 78.3
GCC (MoCo, freeze) 65.6 75.2
GCC (rand, full) 64.2 76.9
GCC (E2E, full) 68.3 80.5
GCC (MoCo, full) 67.2 80.6
Table 2. Node classification.

4.2.2. Graph Classification

  # graphs 1,000 1,500 5,000 2,000 5,000
  # classes 2 3 3 2 5
  Avg. # nodes 19.8 13.0 74.5 429.6 508.5
  DGK 67.0 44.6 73.1 78.0 41.3
  graph2vec 71.1 50.4 75.8 47.9
  InfoGraph 73.0 49.7 82.5 53.5
  GCC (E2E, freeze) 71.7 49.3 74.7 87.5 52.6
  GCC (MoCo, freeze) 72.0 49.4 78.9 89.8 53.7
  DGCNN 70.0 47.8 73.7
  GIN 75.6 51.5 80.2 89.4 54.5
  GCC (rand, full) 75.6 50.9 79.4 87.8 52.1
  GCC (E2E, full) 70.8 48.5 79.0 86.4 47.4
  GCC (MoCo, full) 73.8 50.3 81.1 87.6 53.0
Table 3. Graph classification.

Setup. We use five datasets from Yanardag and Vishwanathan (2015) — COLLAB, IMDB-BINARY, IMDB-MULTI, REDDITBINARY and REDDIT-MULTI5K, which are widely benchmarked in recent graph classification models (Hu et al., 2019a; Sun et al., 2019; Zhang et al., 2018). Each dataset is a set of graphs where each graph is associated with a label. To evaluate GCC on this task, we use raw input graphs as the input of GCC. Then the encoded graph-level representation is fed into a classification layer to predict the label of the graph. We compare GCC with several recent developed graph classification models, including Deep Graph Kernel (DGK) (Yanardag and Vishwanathan, 2015), graph2vec (Narayanan et al., 2017), InfoGraph (Sun et al., 2019), DGCNN (Zhang et al., 2018) and GIN (Xu et al., 2019). Among these baselines, DGK, graph2vec and InfoGraph belong to the first category, while DGCNN and GIN belong to the second category.

Experimental Results. Table 3 shows the comparison. In the first category, GCC (MoCo, freeze) performs competitively to InfoGraph in IMDB-B and IMDB-M, while achieves the best performance in other datasets. Again, we want to emphasize that DGK, graph2vec and InfoGraph all need to be pre-trained on target domain graphs, but GCC only relies on the graphs listed in Table 1 for pre-training. In the second category, we compare GCC with DGCNN and GIN. GCC achieves better performance than DGCNN and comparable performance to GIN. GIN is a recently proposed SOTA model for graph classification. We follow the instructions in the paper (Xu et al., 2019) to train GIN and report the detailed results in Table 7 in the Appendix. We can see that, in each dataset, the best performance of GIN is achieved by different hyper-parameters. And by varying hyper-parameters, GIN’s performance could be sensitive. However, GCC on all datasets shares the same pre-training/fine-tuning hyper-parameters, showing its robustness on graph classification.

2,867 2,607 2,851 3,548 2,616 2,559
7,637 4,774 6,354 7,076 8,304 6,668
# groud truth 697 874 898
20 40 20 40 20 40
Random 0.0198 0.0566 0.0223 0.0447 0.0221 0.0521
RolX 0.0779 0.1288 0.0548 0.0984 0.0776 0.1309
Panther++ 0.0892 0.1558 0.0782 0.1185 0.0921 0.1320
GraphWave 0.0846 0.1693 0.0549 0.0995 0.0947 0.1470
GCC (E2E) 0.1047 0.1564 0.0549 0.1247 0.0835 0.1336
GCC (MoCo) 0.0904 0.1521 0.0652 0.1178 0.0846 0.1425
Table 4. Top- similarity search ().

4.2.3. Top- Similarity Search

Setup. We adopt the co-author dataset from Zhang et al. (2015), which are the conference co-author graphs of KDD, ICDM, SIGIR, CIKM, SIGMOD, and ICDE. The problem of top- similarity search is defined as follows. Given two graphs and , for example KDD and ICDM co-author graphs, we want to find the most similar vertex from for each vertex in . In this dataset, the ground truth is defined to be authors publish in both conferences. Note that similarity search is an unsupervised task, so we evaluate GCC without fine-tuning. Especially, we first extract two subgraphs centered at and by random walk with restart graph sampling. After encoding them by GCC, we measure the similarity score between and to be the inner product of their representations. Finally, by sorting the above scores, we use HITS@10 (top-10 accuracy) to measure the performance of different methods. We compare GCC with RolX (Henderson et al., 2012), Panther++ (Zhang et al., 2015) and GraphWave (Donnat et al., 2018). We also provide random guess results for reference.

Experimental Results. Table 4 presents the performance of different methods on top- similarity search task in three co-author networks. We can see that, compared with Panther++ (Zhang et al., 2015) and GraphWave (Donnat et al., 2018) which are trained in place on co-author graphs, simply applying pre-trained GCC can be competitive.

Overall, we show that a graph neural network encoder pre-trained on several popular graph datasets can be directly adapted to new graph datasets and unseen graph learning tasks. More importantly, compared with models trained from scratch, the reused model achieves competitive and sometimes better performance. This demonstrates the transferability of graph structural patterns and the effectiveness of our GCC framework in capturing these patterns.

4.3. Ablation Studies

Effect of pre-training. It is still not clear if GCC’s good performance is due to pre-training or the expression power of its GIN (Xu et al., 2019) encoder. To answer this question, we fully fine-tune GCC with its GIN encoder randomly initialized, which is equivalent to train a GIN encoder from scratch. We name this model GCC (rand), as shown in Table 2 and Table 3. In all datasets except IMDB-B, GCC (MoCo) outperforms its randomly initialized counterpart, showing that pre-training always provides a better start point for fine-tuning than random initialization. For IMDB-B, we attribute it to the domain shift between pre-training data and down-stream tasks.

Contrastive loss mechanisms. The common belief is that MoCo has stronger expression power than E2E (He et al., 2020), and a larger dictionary size always helps. We also observe such trends, as shown in Figure 3. However, the effect of a large dictionary size is not as significant as reported in computer vision tasks (He et al., 2020). For example, MoCo () merely outperforms MoCo () by small margins in terms of accuracy — 1.0 absolute gain in US-Airport and 0.8 absolute gain in COLLAB. However, training MoCo is much more economical than training E2E. E2E (K=1024) takes 5 days and 16 hours, while MoCo (K=16384) only needs 9 hours. Detailed training time can be found in Table 5 in the Appendix.

(a) US-Airport
Figure 3. Comparison of contrastive loss mechanisms.
momentum 0 0.9 0.99 0.999 0.9999
US-Airport 62.3 63.2 63.7 65.6 61.5
COLLAB 76.6 75.1 77.4 78.9 79.4
Table 5. Momentum ablation.

Momentum. As mentioned in MoCo (He et al., 2020), momentum plays a subtle role in learning high-quality representations. Table 5 shows accuracy with different momentum values on US-Airport and COLLAB datasets. For US-Airport, the best performance is reached by , which is the desired value in (He et al., 2020), showing that building a consistent dictionary is important for MoCo. However, in COLLAB, it seems that a larger momentum value brings better performance. Moreover, we do not observe the “training loss oscillation” reported in (He et al., 2020) when setting . GCC (MoCo) converges well, but the accuracy is much worse.

Pre-training datasets. We ablate the number of datasets used for pre-training. To avoid enumerating a combinatorial space, we pre-train with first several datasets in Table 1

, and report the 10-fold validation accuracy scores on US-Airport and COLLAB, respectively. For example, when using one dataset for pre-training, we select Academia; when using two, we choose Academia and DBLP (SNAP); and so on. We present ordinary least squares (OLS) estimates of the relationship between the number of datasets and the model performance. As shown in Figure 

4, we can observe a trend towards higher accuracy when using more datasets for pre-training. On average, adding one more dataset leads to 0.43 and 0.81 accuracy (%) gain on US-Airport and COLLAB, respectively. 666The effect on US-Airport is positive, but statistically insignificant (), while the effect on COLLAB is positive and significant ().

(a) US-Airport:
Figure 4. Ablation study on pre-training datasets.

5. Conclusion

In this work, we study graph representation learning with the goal of characterizing and transferring structural features in social and information networks. We present Graph Contrastive Coding (GCC), which is a graph-based contrastive learning framework to learn structural representations and similarity from data. The pre-trained model achieves competitive performance to its supervised trained-from-scratch counterparts in three graph learning tasks on ten graph datasets. In the future, we plan to benchmark more graph learning tasks and more graph datasets. We also would like to explore applications of GCC on graphs in other domains, such as protein-protein association networks (Szklarczyk et al., 2016).

Acknowledgements. The work is supported by the National Key R&D Program of China (2018YFB1402600), NSFC for Distinguished Young Scholar (61825602), and NSFC (61836013).


  • R. Albert and A. Barabási (2002) Statistical mechanics of complex networks. Reviews of modern physics 74 (1), pp. 47. Cited by: §1.
  • J. I. Alvarez-Hamelin, L. Dall’Asta, A. Barrat, and A. Vespignani (2006) Large scale networks fingerprinting and visualization using the k-core decomposition. In Advances in neural information processing systems, pp. 41–50. Cited by: §2.1.
  • L. Backstrom, D. Huttenlocher, J. Kleinberg, and X. Lan (2006) Group formation in large social networks: membership, growth, and evolution. In KDD ’06, pp. 44–54. Cited by: §4.1.
  • P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner, et al. (2018) Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261. Cited by: §1, §3.3, footnote 3.
  • A. R. Benson, D. F. Gleich, and J. Leskovec (2016) Higher-order organization of complex networks. Science 353 (6295), pp. 163–166. Cited by: §2.1, §3.1.
  • S. P. Borgatti and M. G. Everett (2000) Models of core/periphery structures. Social networks 21 (4), pp. 375–395. Cited by: §1.
  • R. S. Burt (2009) Structural holes: the social structure of competition. Harvard university press. Cited by: §2.1, §3.1.
  • C. Chang and C. Lin (2011)

    LIBSVM: a library for support vector machines

    ACM transactions on intelligent systems and technology (TIST) 2 (3), pp. 1–27. Cited by: footnote 4.
  • K. Clark, M. Luong, Q. V. Le, and C. D. Manning (2019) ELECTRA: pre-training text encoders as discriminators rather than generators. In ICLR ’19, Cited by: §3.2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT ’19, pp. 4171–4186. Cited by: §1, §2.3.
  • Y. Dong, N. V. Chawla, and A. Swami (2017) Metapath2vec: scalable representation learning for heterogeneous networks. In KDD ’17, pp. 135–144. Cited by: §1, §2.3.
  • C. Donnat, M. Zitnik, D. Hallac, and J. Leskovec (2018) Learning structural node embeddings via diffusion wavelets. In KDD ’18, pp. 1320–1329. Cited by: §A.2.1, §A.2.3, §1, §2.1, §4.2.1, §4.2.3, §4.2.3, §4.2, §4.
  • J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl (2017) Neural message passing for quantum chemistry. In ICML ’17, pp. 1263–1272. Cited by: §2.1.
  • A. Grover and J. Leskovec (2016) Node2vec: scalable feature learning for networks. In KDD ’16, pp. 855–864. Cited by: §1, §2.1, §2.3, §3.1.
  • R. Hadsell, S. Chopra, and Y. LeCun (2006) Dimensionality reduction by learning an invariant mapping. In CVPR ’06, Vol. 2, pp. 1735–1742. Cited by: §2.2.
  • W. Hamilton, Z. Ying, and J. Leskovec (2017) Inductive representation learning on large graphs. In Advances in neural information processing systems, pp. 1024–1034. Cited by: §2.1, §3.3.
  • K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2020) Momentum contrast for unsupervised visual representation learning. In CVPR ’20, pp. 9729–9738. Cited by: §1, §2.2, §3.2, §3.3, §3.3, §3.4, §3.4, §3.4, §4.3, §4.3.
  • K. Henderson, B. Gallagher, T. Eliassi-Rad, H. Tong, S. Basu, L. Akoglu, D. Koutra, C. Faloutsos, and L. Li (2012) Rolx: structural role extraction & mining in large graphs. In KDD ’12, pp. 1231–1239. Cited by: §A.2.3, §2.1, §4.2.3.
  • W. Hu, B. Liu, J. Gomes, M. Zitnik, P. Liang, V. Pande, and J. Leskovec (2019a) Pre-training graph neural networks. In ICLR ’19, Cited by: §1, §2.3, §3.1, §4.2.2.
  • Z. Hu, C. Fan, T. Chen, K. Chang, and Y. Sun (2019b) Unsupervised pre-training of graph convolutional networks. ICLR 2019 Workshop: Representation Learning on Graphs and Manifolds. Cited by: §1, §2.3.
  • G. Jeh and J. Widom (2002) SimRank: a measure of structural-context similarity. In KDD ’02, pp. 538–543. Cited by: §2.1.
  • Y. Jin, G. Song, and C. Shi (2019) GraLSP: graph neural networks with local structural patterns. arXiv preprint arXiv:1911.07675. Cited by: §3.3.
  • K. Kersting, N. M. Kriege, C. Morris, P. Mutzel, and M. Neumann (2016) Benchmark data sets for graph kernels. External Links: Link Cited by: §A.3.2.
  • D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. Cited by: §4.1.
  • T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. In ICLR ’17, Cited by: §1, §1, §2.1, §2.3.
  • E. A. Leicht, P. Holme, and M. E. Newman (2006) Vertex similarity in networks. Physical Review E 73 (2), pp. 026120. Cited by: §2.1.
  • J. Leskovec and C. Faloutsos (2006) Sampling from large graphs. In KDD ’06, pp. 631–636. Cited by: §3.3, §3.3.
  • J. Leskovec, J. Kleinberg, and C. Faloutsos (2005) Graphs over time: densification laws, shrinking diameters and possible explanations. In KDD ’05, pp. 177–187. Cited by: §1.
  • S. Micali and Z. A. Zhu (2016) Reconstructing markov processes from independent and anonymous experiments. Discrete Applied Mathematics 200, pp. 108–122. Cited by: §3.3.
  • T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: §1, §2.2, §2.3, §3.2.
  • R. Milo, S. Itzkovitz, N. Kashtan, R. Levitt, S. Shen-Orr, I. Ayzenshtat, M. Sheffer, and U. Alon (2004) Superfamilies of evolved and designed networks. Science 303 (5663), pp. 1538–1542. Cited by: §1.
  • R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, and U. Alon (2002) Network motifs: simple building blocks of complex networks. Science 298 (5594), pp. 824–827. Cited by: §2.1, §3.1.
  • A. Narayanan, M. Chandramohan, R. Venkatesan, L. Chen, Y. Liu, and S. Jaiswal (2017) Graph2vec: learning distributed representations of graphs. arXiv preprint arXiv:1707.05005. Cited by: §A.2.2, §4.2.2, §4.2.
  • M. E. Newman (2006) Modularity and community structure in networks. Proceedings of the national academy of sciences 103 (23), pp. 8577–8582. Cited by: §1.
  • A. v. d. Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §2.2, §3.2.
  • J. Pan, H. Yang, C. Faloutsos, and P. Duygulu (2004) Automatic multimedia cross-modal correlation discovery. In KDD ’04, pp. 653–658. Cited by: §2.1.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pp. 8024–8035. Cited by: §A.1.2.
  • F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al. (2011) Scikit-learn: machine learning in python. Journal of machine learning research 12 (Oct), pp. 2825–2830. Cited by: §4.2.
  • B. Perozzi, R. Al-Rfou, and S. Skiena (2014) Deepwalk: online learning of social representations. In KDD ’14, pp. 701–710. Cited by: §1, §1, §2.1, §2.3, §3.1.
  • J. Qiu, Y. Dong, H. Ma, J. Li, C. Wang, K. Wang, and J. Tang (2019) Netsmf: large-scale network embedding as sparse matrix factorization. In The World Wide Web Conference, pp. 1509–1520. Cited by: §1.
  • J. Qiu, Y. Dong, H. Ma, J. Li, K. Wang, and J. Tang (2018a) Network embedding as matrix factorization: unifying deepwalk, line, pte, and node2vec. In WSDM ’18, pp. 459–467. Cited by: §1.
  • J. Qiu, J. Tang, H. Ma, Y. Dong, K. Wang, and J. Tang (2018b) Deepinf: social influence prediction with deep learning. In KDD ’18, pp. 2110–2119. Cited by: §3.3, §3.3.
  • L. F. Ribeiro, P. H. Saverese, and D. R. Figueiredo (2017) Struc2vec: learning node representations from structural identity. In KDD ’17, pp. 385–394. Cited by: §A.2.1, §A.3.1, §1, §2.1, §4.2.1, §4.2.1, §4.2, §4.
  • S. C. Ritchie, S. Watts, L. G. Fearnley, K. E. Holt, G. Abraham, and M. Inouye (2016) A scalable permutation approach reveals replication and preservation patterns of network modules in large datasets. Cell systems 3 (1), pp. 71–82. Cited by: §4.1.
  • D. A. Spielman and S. Teng (2013) A local clustering algorithm for massive graphs and its application to nearly linear time graph partitioning. SIAM Journal on computing 42 (1), pp. 1–26. Cited by: §3.5.
  • F. Sun, J. Hoffman, V. Verma, and J. Tang (2019) InfoGraph: unsupervised and semi-supervised graph-level representation learning via mutual information maximization. In ICLR ’19, Cited by: §A.2.2, §1, §4.2.2, §4.2, §4.
  • D. Szklarczyk, J. H. Morris, H. Cook, M. Kuhn, S. Wyder, M. Simonovic, A. Santos, N. T. Doncheva, A. Roth, P. Bork, et al. (2016) The string database in 2017: quality-controlled protein–protein association networks, made broadly accessible. Nucleic acids research, pp. gkw937. Cited by: §5.
  • J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei (2015) Line: large-scale information network embedding. In WWW ’15, pp. 1067–1077. Cited by: §1, §2.1, §2.3, §3.1.
  • S. Teng et al. (2016) Scalable algorithms for data and network analysis. Foundations and Trends® in Theoretical Computer Science 12 (1–2), pp. 1–274. Cited by: §3.5.
  • Y. Tian, D. Krishnan, and P. Isola (2019) Contrastive multiview coding. arXiv preprint arXiv:1906.05849. Cited by: §2.2.
  • H. Tong, C. Faloutsos, and J. Pan (2006) Fast random walk with restart and its applications. In ICDM ’06, pp. 613–622. Cited by: §3.3.
  • J. Ugander, L. Backstrom, C. Marlow, and J. Kleinberg (2012) Structural diversity in social contagion. Proceedings of the National Academy of Sciences 109 (16), pp. 5962–5966. Cited by: §2.1, §3.1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1, §3.3.
  • P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio (2018) Graph attention networks. ICLR ’18. Cited by: §2.1.
  • U. Von Luxburg (2007)

    A tutorial on spectral clustering

    Statistics and computing 17 (4), pp. 395–416. Cited by: Definition 3.2.
  • A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman (2019a) GLUE: A multi-task benchmark and analysis platform for natural language understanding. In ICLR ’19, Cited by: §1.
  • M. Wang, L. Yu, D. Zheng, Q. Gan, Y. Gai, Z. Ye, M. Li, J. Zhou, Q. Huang, C. Ma, et al. (2019b) Deep graph library: towards efficient and scalable deep learning on graphs. arXiv preprint arXiv:1909.01315. Cited by: §A.1.2, §3.3.
  • D. J. Watts and S. H. Strogatz (1998) Collective dynamics of small-world networks. nature 393 (6684), pp. 440. Cited by: §1.
  • Z. Wu, Y. Xiong, S. X. Yu, and D. Lin (2018) Unsupervised feature learning via non-parametric instance discrimination. In CVPR ’18, pp. 3733–3742. Cited by: §1, §2.2, §3.2, §3.4.
  • K. Xu, W. Hu, J. Leskovec, and S. Jegelka (2019) How powerful are graph neural networks?. In ICLR ’19, Cited by: §A.2.2, §1, §1, §3.3, §3.3, §4.1, §4.2.2, §4.2.2, §4.2, §4.3, §4, footnote 4.
  • P. Yanardag and S. Vishwanathan (2015) Deep graph kernels. In KDD ’15, pp. 1365–1374. Cited by: §A.2.2, §4.2.2, §4.2, §4, footnote 4.
  • J. Yang and J. Leskovec (2015) Defining and evaluating network communities based on ground-truth. Knowledge and Information Systems 42 (1), pp. 181–213. Cited by: §4.1.
  • R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. Hamilton, and J. Leskovec (2018)

    Graph convolutional neural networks for web-scale recommender systems

    In KDD ’18, pp. 974–983. Cited by: §2.1.
  • F. Zhang, X. Liu, J. Tang, Y. Dong, P. Yao, J. Zhang, X. Gu, Y. Wang, B. Shao, R. Li, and et al. (2019a) OAG: toward linking large-scale heterogeneous entity graphs. In KDD ’19, pp. 2585–2595. Cited by: §A.3.1, §4.2.1.
  • J. Zhang, Y. Dong, Y. Wang, J. Tang, and M. Ding (2019b) ProNE: fast and scalable network representation learning. In IJCAI ’19, pp. 4278–4284. Cited by: §A.2.1, §1, §4.2.1.
  • J. Zhang, J. Tang, C. Ma, H. Tong, Y. Jing, and J. Li (2015) Panther: fast top-k similarity search on large networks. In KDD ’15, pp. 1445–1454. Cited by: §A.2.3, §A.3.3, §4.2.3, §4.2.3.
  • M. Zhang, Z. Cui, M. Neumann, and Y. Chen (2018) An end-to-end deep learning architecture for graph classification. In AAAI ’18, Cited by: §A.2.2, §4.2.2, §4.2.

Appendix A Appendix

Hyper-parameter E2E MoCo
Batch size 1024 32
Restart probability 0.8 0.8
Dictionary size 1023 16384
Temperature 0.07 0.07
Momentum NA 0.999
Warmup steps 7,500 7,500
Weight decay 1e-5 1e-5
Training steps 75,000 75,000
Initial learning rate 0.005 0.005
Learning rate decay Linear Linear
Adam 1e-8 1e-8
Adam 0.9 0.9
Adam 0.999 0.999
Gradient clipping 1.0 1.0
Number of layers 5 5
Hidden units per layer 64 64
Dropout rate 0.5 0.5
Degree embedding size 16 16
Table 6. Pre-training hyper-parameters for E2E and MoCo.
Figure 5. Pre-training time of different contrastive loss mechanisms and dictionary size .

a.1. Implementation details

a.1.1. Hardware Configuration

The experiments are conducted on Linux servers equipped with an Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz, 256GB RAM and 8 NVIDIA 2080Ti GPUs.

a.1.2. Software Configuration

All models are implemented in PyTorch (Paszke et al., 2019) version 1.3.1, DGL (Wang et al., 2019b) version 0.4.1 with CUDA version 10.1, scikit-learn version 0.20.3 and Python 3.6. Our code and datasets will be available.

 batch dropout degree IMDB-B IMDB-M COLLAB RDT-B RDT-M
 32 0 no 72.9 48.7 69.2 83.5 52.2
 32 0.5 no 71.6 49.1 66.2 77.5 51.7
 128 0 no 75.6 50.9 73.0 89.4 54.5
 128 0.5 no 74.6 51.0 72.3 81.8 53.5
 32 0 yes 73.9 51.1 79.7 77.0 46.2
 32 0.5 yes 74.5 50.1 79.1 77.4 45.6
 128 0 yes 73.3 51.0 79.8 76.7 45.6
 128 0.5 yes 73.1 51.5 80.2 77.4 45.8
Table 7. Performance of GIN model under various hyper-parameter configurations.

a.1.3. Pre-training

The detailed hyper-parameters are listed in Table 6. Training times of GCC variants are listed in Figure 5.777The table shows the elapsed real time for pre-training, which might be affected by other programs running on the server. The training time of GCC (E2E) grows sharply with the dictionary size K while GCC (MoCo) roughly remains the same, which indicates that MoCo is more economical and easy to scale with larger dictionary size.

a.2. Baselines

a.2.1. Node Classification

GraphWave (Donnat et al., 2018) We download the authors’ official source code and keep all the training settings as the same. The implementation requires a networkx graph and time points as input. We convert our dataset to the networkx format, and use automatic selection of the range of scales provided by the authors. We set the output embedding dimension to 64.

Struc2vec (Ribeiro et al., 2017) We download the authors’ official source code and use default hyper-parameters provided by the authors: (1) walk length = 80; (2) number of walks = 10; (3) window size = 10; (4) number of iterations = 5.

The only modifications we do are: (1) number of dimensions = 64; (2) number of workers = 48 to speed up training.

We find the method hard to scale on the H-index datasets although we set the number of workers to 48, compared to 4 by default. We keep the code running for 24 hours on the H-index datasets and it failed to finish. We observed that the sampling strategy in Struc2vec takes up most of the time, as illustrated in the original paper.

ProNE (Zhang et al., 2019b) We download the authors’ official code and keep hyper-parameters as the same: (1) step = 10; (2) = 0.5; (3) = 0.2. The dimension size is set to 64.

a.2.2. Graph Classification

DGK (Yanardag and Vishwanathan, 2015), graph2vec (Narayanan et al., 2017), InfoGraph (Sun et al., 2019), DGCNN (Zhang et al., 2018) We adopt the reported results in these papers. Our experimental setting is exactly the same except for the dimension size. Note that graph2vec uses 1024 and InfoGraph uses 512 as the dimension size. Following GIN, we use 64.

GIN (Xu et al., 2019) We use the official code released by (Xu et al., 2019) and follow exactly the procedure described in their paper: the hyper-parameters tuned for each dataset are: (1) the number of hidden units for bioinformatics graphs and for social graphs; (2) the batch size ; (3) the dropout ratio after the dense layer; (4) the number of epochs, i.e., a single epoch with the best cross-validation accuracy averaged over the 10 folds was selected. We report the obtained results in Table 7.

a.2.3. Top- Similarity Search

Random, RolX (Henderson et al., 2012), Panther++ (Zhang et al., 2015) We obtain the experimental results for these baselines from Zhang et al. (2015).

GraphWave (Donnat et al., 2018) Embeddings computed by the GraphWave method also have the ability to generalize across graphs. The authors evaluated on synthetic graphs in their paper which are not publicly available. To compare with GraphWave on the co-author datasets, we compute GraphWave embeddings given two graphs and follow the same procedure mentioned in section 4.2.2 to compute the HITS@10 (top-10 accuracy) score.

a.3. Datasets

a.3.1. Node Classification Datasets

US-Airport 888https://github.com/leoribeiro/struc2vec/tree/master/graph We obtain the US-Airport dataset directly from Ribeiro et al. (2017).

H-index 999https://www.openacademic.ai/oag/ We create the H-index dataset, a co-authorship graph extracted from OAG (Zhang et al., 2019a). Since the original OAG co-authorship graph has millions of nodes, it is too large as a node classification benchmark. Therefore, we implemented the following procedure to extract smaller subgraphs from OAG:

  1. [leftmargin=*,itemsep=0pt,parsep=0.5em,topsep=0.3em,partopsep=0.3em]

  2. Select an initial vertex set in OAG;

  3. Run breadth first search (BFS) from until nodes are visited;

  4. Return the sub-graph induced by the visited nodes.

We set , and randomly select 20 nodes from top 200 nodes with largest degree as he initial vertex set in step (1).

a.3.2. Graph Classification Datasets


We download COLLAB, IMDB-BINARY, IMDB-MULTI, REDDIT-BINARY and REDDIT-MULTI5K from Benchmark Data Sets for Graph Kernels (Kersting et al., 2016).

a.3.3. Top- Similarity Search Datasets


We obtain the paired conference co-author datasets, including KDD-ICDM, SIGIR-CIKM, SIGMOD-ICDE, from the Zhang et al. (2015) and make them publicly available with the permission of the original authors.