Graph neural networks (GNNs) have made remarkable advancements in representation learning for graph-structured data Kipf and Welling (2016); Gilmer et al. (2017); Veličković et al. (2017). Combining modeling the rich topology of graphs and unparalled expressive ability of deep learning, GNNs learn low-dimensional embeddings from variable-size and permutation-invariant graphs. The success of GNNs has also benefited a wide range of applications, such as in social networks Kipf and Welling (2016), modecules Duvenaud et al. (2015), robot designs Wang et al. (2019)
and knowledge graphsVivona and Hassani (2019).
Like other deep learning methods, many existing GNNs and their variants are mainly based on semi-supervised setting so they need a certain number of labeled data. However, requiring enough quality labeled data may meet some challenges in the real application scenarios. For instance, biology graphs represent specific concepts Sun et al. (2019), so it is difficult and expensive to annotate the graphs; in addition, the reliability of the given labels may sometimes be questionable Peng et al. (2020).
Though the present unsupervised graph representation algorithms do not need labels, they rely heavily on negative samples. For example, random walk-based methods Perozzi et al. (2014); Tang et al. (2015)
consider node pairs that are ”close” in the graph are positive samples, meanwhile, take node pairs that are ”far” in the graph as negative samples. The loss function enforces ”close” node pairs to have more similar representations than ”far” node pairs. In addition, deep graph informax (DGI)Velickovic et al. (2019) maximizes mutual information between patch representations and corresponding high-level summaries while taking the corruption graph as negative samples. The performances of these methods are highly dependent on the choices of negative samples, but in the wild negative pairs are not easy or computationally expensive to acquire. Consequently, how to obtain high-quality graph representation without supervision or negative examples becomes necessary for a number of practical applications, which motivates the study of this paper.
Recently, BYOL (bootstrap your own latent) Grill et al. (2020) introduces bootstrapping mechanism to visual representation learning and learns from its previous version, which has achieved state-of-the-art results without negative examples. Nevertheless, BYOL is based on image data, and to the best knowledge of us, no work has applied bootstrapping mechanism to graph-structure data. In this work, we extend BYOL to graph-structured data and propose deep graph bootstrapping (DGB). DGB relies on two neural networks: online and target networks. During one time training, the target network is fixed, then the online network is updated by gradient descent to predict the target network; after training, the target network parameters are updated with a slow-moving average of the online network parameters. Such training mechanism can make the online and target networks learn from each other, hence, DGB no longer needs labeled data or negative examples.
Data augmentations play an important role in DGB, but how to design efficient augmentation methods for graph-structured data is still challenging. In this paper, we systematically summarize three kinds of augmentation methods: node augmentation, including node feature dropout and node dropout; adjacent matrix augmentation, including personalized PageRank (PPR) Page et al. (1999) and heat kernel Kondor and Lafferty (2002); and the combination of them. We also apply these augmentation methods to DGB and show how augmentation methods help to improve the performances of DGB.
In conclusion, we propose a new unsupervised graph representation learning method without negative examples and systematically summarize three augmentation methods for graph-structured data. The main contributions of this paper are as followings:
We first generalize bootstrapping mechanism to graph-structured data, and propose an unsupervised graph representation learning method DGB without negative examples.
The experimental results of the benchmark datasets show the DGB model is superior to the present supervised and unsupervised graph representation models.
We systematically summarize three kinds of augmentation methods for graph-structured data, and apply them to DGB then analyze how data augmentations affect the performances of DGB.
2 Related Works
2.1 Random Walk Based Methods
Random walk-based methods Perozzi et al. (2014); Tang et al. (2015); Grover and Leskovec (2016) generate random walks across nodes, and then apply neural language modes to get network embedding. They are known to over-emphasize proximity information at the expense of structural information Velickovic et al. (2019); Hassani and Khasahmadi (2020).
2.2 Target Networks
Target networks have a wide range of applications in deep reinforcement learningVan Hasselt et al. (2018). As one of the two important components of deep Q-network Mnih et al. (2015), target networks make the training process more stable and alleviate oscillations or divergence. In deep Q-network, every C updates the network is cloned to obtain a target network , and is used to generate the Q-learning targets for the following C updates. Target networks are extended to soft target updates, rather than directly copying the weights in Lillicrap et al. (2015), as a result, the target values have to change slowly, which can improve the stability of learning.
2.3 Graph Diffusion
Based on sparsified generalized graph diffusion, graph diffusion convolution(GDC) is a powerful spatially extension of message passing in GNNs Klicpera et al. (2019). GDC can generate a new sparse graph which is not limited to message passing neural network, so GDC is able to be applied for any existing graph-based model or algorithm without requiring changing the model.
2.4 Contrastive Methods
Contrastive methods employ a scoring function that enforces a higher score on positive pairs and a lower score on negative pairs Velickovic et al. (2019); Li et al. (2019). Some recent works adapt contrastive ideas in image representation learning to unsupervised graph learning Velickovic et al. (2019); Sun et al. (2019). Deep graph informax Velickovic et al. (2019) extends deepInfomax(DIM) Hjelm et al. (2018) and contrasts node and graph embedding to learn node embeddings; in addition, InfoGraph extends DIM to learn graph embeddings.
2.5 Bootstrapping Methods
Different from contrastive methods’ requiring many negative examples to work well He et al. (2020); Chen et al. (2020), bootstrapping methods can learn representations without negative examples in an unsupervised manner. DeepCluster Caron et al. (2018) produces targets for the next representation by bootstrapping the previous representation; it clusters data points based on the prior representation and uses the clustered index of each example as a classification target to train the new representation. Predictions of Bootstrapped Latents(PBL) Guo et al. (2020) apply bootstrapping methods to multitask reinforcement learning. PBL predicts latent embeddings of future observations to train its representations, and the latent embeddings are themselves trained to be predictive of the aforementioned representations. BYOL Grill et al. (2020) uses two neural networks, referred to as online and target networks; the outputs of the target network serve as targets to train the online network and the parameters of the target network are updated by a slow-moving average of the online network parameters.
3 Unsupervised Graph Representation Learning
A graph can be represented as , where represents the node features, is the number of nodes in the input graph and
means the feature vector of node; is an adjacency matrix, represents there exists an edge from node to node and otherwise.
The objective of DGB is to learn an encoder, , as a result we can get , represents the final learned embedding for node , which is for the downstream tasks, such as node classification or node cluster.
4 Deep Graph Bootstrapping
In this section, we introduce the proposed DGB model in detail. DGB model is inspired by BYOL Guo et al. (2020), which learns visual representations by bootstrapping the latent representations. The DGB model learns node embeddings by predicting previous versions of its outputs, without leveraging negative examples. As is shown in figure1, DGB refers to two neural networks, online network and target network. The online network is trained to predict the target network output, and the target network is updated by an exponential moving average of the online network Lillicrap et al. (2015). To be specific, DGB model consists of the following components:
Graph convolution network as graph encoders to get node representations.
Data augmentations for graph-structured data including node augmentation and graph diffusion.
A bootstrapping mechanism to make the online network and target network learn from each other.
4.1 Graph Neural Network Encoder
GNNs learn representations through an iterative process of transforming and aggregating from topological neighbors. In this paper, for simplicity and generalization, we opt for the commonly used graph convolution network(GCN) Kipf and Welling (2016) as our encoders:
where is the adjacency matrix of the input graph with added self-connections,
is the identity matrix andis the degree matrix, is the representation in layer,
is a non-linear activation function,is a trainable weight matrix, which is our final goal to learn.
4.2 Graph-structured data augmentations
Recent successful self-supervised learning approaches in visual domain learn representations by contrasting congruent and incongruent augmentations of imagesGrill et al. (2020). Nevertheless, unlike images with standard augmentation methods, how to get effective augmentation methods for graph-structured data has no consensus. Considering there are actually two kinds of information in the graph: node information and adjacent information, in this paper, we introduce three kinds of graph data augmentations: node augmentation and graph diffusion network for adjacent matrix augmentation and the combination of them.
4.2.1 Node Augmentation
For the node informtion, considering feature matrix , is the number of nodes and is the dimensionality of features. Following Feng (2020)
we use node dropout (ND) and node feature dropout (NFD): node dropout denotes randomly zeroes one node’s entire features with a pre-defined probability, i.e. dropping the row vectors of, randomly; while node feature dropout means randomly discards each element of . The specific formulation of ND and NFD are
where and are the dropout probability of ND and NFD repectively, and seperately draws from Bernoulli(1 - ), Bernoulli(1 - ). The factor and are to make the perturbed feature matrix equal to in expectation. To note that, NFD and ND are only used during training. After training, we use initial node features to calculate representations.
4.2.2 Adjacent Matrix Augmentation
For the adjacent matrix augmentation, we consider graph diffusion networks Klicpera et al. (2019). In previous GNNs, there exist two problems: one is that each GNN layer limits the message passing within one-hop neighbors, which is not reasonable in the real world; the other is that edges are often noisy or defined with an arbitrary threshold Tang et al. (2018). Graph diffusion networks(GDN) Klicpera et al. (2019) are proposed to tackle the two problems. GDN combines spatial message passing with a sparsified form of graph diffusion which can be regarded as an equivalent polynomial filter. GDC generates a new graph by sparsifying a generalized form of graph diffusion, as a result, GDC can aggregate information from a larger neighborhood rather than only from the first-hop neighbors.
For a graph , the generalized graph diffusion is formulated as:
where denotes the generalized transition matrix, is the weighting coefficient which determines the ratio of global-local information. In order to guarantee convergence, two conditions are considered: ; where
are eigenvalues of.
Two popular instantiations of the graph diffusion are personalized PageRank (PPR) Page et al. (1999) and heat kernel Kondor and Lafferty (2002). For an adjacency matrix and a degree matrix , in equation 4 is defined as , for PPR and for heat kernel, where is teleport probability and is the diffusion time. The specific formulation is showed in equation 5, equation 6 Hassani and Khasahmadi (2020):
In the ”augmentation” part of Figure 1, we apply NFD to the first view, and NFD + graph diffusion to the second view. Comparing with the initial graph, we can easily see how these augmentation methods work. For simplicity, we do not scale the augmented node features in Figure 1. We apply different combinations of the augmentation methods above to DGB, and the results and discussions will be in Section 5.
4.3 Bootstrapping Process
The bootstrapping process is the core of the DGB model. As is shown in figure 1, the online network consists of three steps: one graph convolution network layer called
, one multilayer perceptron for projection
, and one multilayer perceptron for prediction; similar to the online network, the target network has corresponding and . For simplicity, in the following, we use to denote the parameters of , , and use to denote the parameters of , . The reason for using a multilayer perceptron for projection has been proved to improve performances Chen et al. (2020), and we will have a further discussion in the ablation study.
During each epoch training, the parameters of the target network are fixed, and the regression loss between the online network outputsand the target network outputs are used to update the online network parameters. After one epoch training, the target network parameters are updated using an exponential moving average of the Lillicrap et al. (2015); Guo et al. (2020). Specifically, after one epoch training, for a given target decay rate , this paper uses the following updating process:
when , the target network will be never updated and remains at a constant value; when , the target network will be updated to the online network at each step. Therefore, we need a trade-off value for to update the target network at a proper speed.
4.4 Training Method
As is shown in Figure 1, the loss function in DGB is the mean squared error between online network predictions and target network projections . Before calculating the error, we first -normalize the predictions and projections:
The loss function is:
After getting , we seperately feed the second view augmentation to the online network and the first view augmentation to the target network to compute . During each epoch training, + is used to minimize via stochastic optimization with respect to the parameters of , and .
After training, we only keep the GNN layer of the online network , and use to compute the representations for downstream tasks.
In this section, we introduce experiments to demonstrate the effectiveness of the DGB model. The datasets in our experiments are three standard benchmark citation network datasets, namely Cora, Citeseet, and PubmedSen et al. (2008). In the three datasets, nodes represent documents, edges correspond to citations, and each node has a feature vector corresponding to the bag-of-words representation. Each node can be divided into one of the several classes. More details about the three datasets are in Table 1. Following the experimental setup in Kipf and Welling (2016), we use the same data splits.
5.2 Evaluation Protocol
The aim of the DGB model is to learn a low dimensional representation of each node based on the node features and the interactions between nodes. After the training process of the DGB model, we use a linear evaluation to demonstrate the effectiveness of the learned representations. Following DGI Velickovic et al. (2019), we report the mean classification accuracy on the test nodes after 50 runs of training followed by a linear model.
To comprehensively evaluate the proposed DGB model, we compare it with six supervised methods and eight unsupervised methods in Table 2.
LP Zhu et al. (2003) is based on a gaussian random field model.
PLANETOID Yang et al. (2016) proposes joint training of classification and graph context prediction.
CHEBYSHEV Defferrard et al. (2016) uses the methods of graph signal processing.
GCN Kipf and Welling (2016) generalizes traditional convolution network to graph structure data.
GAT Veličković et al. (2017) puts attention mechanism into GCN.
DEEPWALK Perozzi et al. (2014) learns representations via random walks on graphs.
EP-B Duran and Niepert (2017) learns label and node representations by exchanging messages between nodes.
GMI Peng et al. (2020) maximizes the mutual information between node features and topological structure.
DGI Velickovic et al. (2019) maximizes mutual information at graph/patch-level.
GMNN Qu et al. (2019) combines the advantages of the statistical relational learning and graph neural network to learn node representations.
The results of LP and DEEPWALK are taken from Kipf and Welling (2016), and the results of RAW FEATURES and CHEBYSHEV are taken from Peng et al. (2020) and Hassani and Khasahmadi (2020) respectively. The results of other methods are taken from their original papers.
5.4 Experiments Setup
All experiments are implemented in PytorchKetkar (2017) and conducted on a single Geforce RTX 2080Ti with 11GB memory size. We use Glorot initialization Glorot and Bengio (2010) to initialize the parameters of the model. We perform row normalization on the three datasets for preprocessing. For graph convolution encoder, we use one layer on the three datasets. During training, we use Adam optimizer Kingma and Ba (2014) with an initial learning rate of 0.001 for the three datasets.
In DGB model, the data augmentation methods play an important role. Consequently, we conduct extensive experiments for different data augmentation combinations in Table 3. After training, we feed the initial node features with the adjacent matrix to the GCN encoder and get node representations for classification task.
|CHEBYSHEV||X, A, Y||81.2||69.8||74.4|
|GCN||X, A, Y||81.5||70.3||79.0|
|GAT||X, A, Y||83.0 0.7||72.50.7||79.00.3|
|GWNN||X, A, Y||82.8||71.7||79.1|
|GMNN(with )||X, A||78.1||68.0||79.3|
|GMNN(with and )||X, A||82.8||71.5||81.6|
5.5 Experiment Analysis
From Table 2, it is apparent that DGB can perform the best among the recent state-of-the-art methods. For example, on Cora, DGB can improve GMI-adaptive by a margin 0.4%, and improve GMI-mean by a margin 0.9% on Citeseer.
We consider this improvement benefitting from two points: one is that the DGB model can learn from its previous version and does not need well-designed negative examples; the other is that node augmentation methods generate multiple node feature matrixes in different epochs, which alleviates overfitting to some extent and improves the DGB’s robustness.
|First view||Second view||Cora||Citeseer||Pubmed|
|NO||IN + ADJ||IN +ADJ||59.2||51.4||62.2|
|NODE + ADJ||IN + ADJ||NFD + DIFF||80.8||71.1||79.8|
|IN + ADJ||ND + DIFF||79.1||65.9||78.5|
|NFD + ADJ||IN +DIFF||82.1||71.4||78.7|
|ND + ADJ||IN + DIFF||77.9||65.4||78.4|
|ND + ADJ||ND + DIFF||79.0||65.5||77.6|
|NFD + ADJ||NFD + DIFF||81.4||72.3||79.0|
|ND + ADJ||NFD + DIFF||82.3||73.3||79.4|
|NFD + ADJ||ND + DIFF|
5.6 Results under different graph data augmentations
In order to demonstrate the relationship between different graph data augmentations and the performances of the DGB model, we combine different augmentations in DGB and show the results in Table 3. In node augmentation experiments, we only use the adjacent matrix; while in adjacent augmentation experiments, we use the initial node features without node dropout or node feature dropout. For fairness, the experiments on one dataset in Table 3
are under the same hyperparameters except for the dropout rate. We select the best dropout rate in node dropout and node feature dropout from 0.1 to 0.9 with step 0.1. After training, we feed the feature matrix and adjacent matrix without augmentation to get representations for node classification.
The results on the three datasets show that node augmentation and adjacent augmentation can both improve the performances, and combining the two augmentations can achieve higher performances overall.
For node augmentation, it is apparent to see NFD does better in improving performance than ND. We suspect randomly dropping one node’s entire features may lose too much information and is difficult to predict for the other view.
There is a clear trend: more combinations of data augmentations bring better performances. For node augmentation group, the combination of NFD & ND can beat the other combinations on the three datasets. Similarly, for node + adj augmentation group, the combination of NFD + ADJ & ND + DIFF and ND + ADJ & NFD + DIFF are better than other combinations.
Since the graph diffusion brings more connections to the nodes, the combination of ND + DIFF does not hurt the graph information too much compared with ND + ADJ. As a result, the combination of NFD + ADJ & ND + DIFF is superior to the combination ND + ADJ & NFD + DIFF.
Though node augmentation and adjacent augmentation can both benefit the model’s performance, node augmentation is superior to adjacent augmentation. The results of NODE augmentation group are much better than ADJ augmentation group on all the three datasets. This may be because node augmentations can generate different node feature matrixes in each epoch because of the randomness of the Bernoulli distribution, while the diffusion matrix is the same across different epochs. The randomness improves the performance and robustness of the DGB model.
5.7 Ablation study
In this section, we consider the function of a multilayer perceptron for projection in our DGB model and how the bootstrapping mechanism helps the DGB model. We conduct experiments on the three datasets with and without a multilayer perceptron for projection. For analysizing the bootstrapping mechanism, we seperately let =0 and =1. When =0, the target network will copy the parameters of the online network after each epoch training; meanwhile the target network parameters will remain constant values when =1.
As is shown in Table 4, the performances of the DGB model with and without MLP have a big difference on the three datasets. Keeping the target network parameters constant or changing them to the online network parameters totally both hurt the performances.
We also visualize the node representations of the online network using t-sne Maaten and Hinton (2008) on Cora dataset in Figure 2. In Figure 2, there are four subgraphs: the representations of the DGB model, the DGB model without a projection layer, the DGB model with =0 and =1.
Comparing the four subgraphs, we can see different classes of the representations without the projection layer become tighter and are not easy to distinguish. When =0, the representations are better than the representations with =0, but they are all worse than the representations with slow-moving average mechanism. Not changing the target network parameters or changing them totally are both not the best option, and there is a trade-off between the two choices.
In this paper, we introduce a new self-supervised graph representation learning method DGB. DGB relies on two neural networks: online network and target network, and the input of each neural network is an augmentation of the initial graph. With the help of the bootstrapping process, the online network and target network can learn from each other. As a result, DGB does not need negative examples and can learn in an unsupervised manner. Experiments on three benchmark datasets show DGB is superior to state-of-the-art methods. In addition, we systematically conclude different graph data augmentation methods: node augmentations, adjacent matrix augmentations and the combination of them. We also apply different data augmentation types to DGB and experiment results show how different augmentation methods affect the performances of the DGB. As we only apply DGB to the homogeneous network, we will focus on how to generalize the DGB model to heterogenous information networks in the future.
Deep clustering for unsupervised learning of visual features. In
Proceedings of the European Conference on Computer Vision (ECCV), pp. 132–149. Cited by: §2.5.
- A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709. Cited by: §2.5, §4.3.
- Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems, pp. 3844–3852. Cited by: item 3.
- Learning graph representations with embedding propagation. In Advances in neural information processing systems, pp. 5119–5130. Cited by: item 9.
- Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, pp. 2224–2232. Cited by: §1.
- Graph random neural network. arXiv preprint arXiv:2005.11079. Cited by: §4.2.1.
- Neural message passing for quantum chemistry. arXiv preprint arXiv:1704.01212. Cited by: §1.
- Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249–256. Cited by: §5.4.
- Bootstrap your own latent: a new approach to self-supervised learning. arXiv preprint arXiv:2006.07733. Cited by: §1, §2.5, §4.2.
- Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 855–864. Cited by: §2.1.
- Bootstrap latent-predictive representations for multitask reinforcement learning. arXiv preprint arXiv:2004.14646. Cited by: §2.5, §4.3, §4.
- Contrastive multi-view representation learning on graphs. arXiv preprint arXiv:2006.05582. Cited by: §2.1, §4.2.2, §5.3.
- Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738. Cited by: §2.5.
Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670. Cited by: §2.4.
- Introduction to pytorch. In Deep learning with python, pp. 195–208. Cited by: §5.4.
- Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.4.
- Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §1, §4.1, item 4, §5.1, §5.3.
- Diffusion improves graph learning. In Advances in Neural Information Processing Systems, pp. 13354–13366. Cited by: §2.3, §4.2.2.
Diffusion kernels on graphs and other discrete structures.
Proceedings of the 19th international conference on machine learning, Vol. 2002, pp. 315–22. Cited by: §1, §4.2.2.
- Graph matching networks for learning the similarity of graph structured objects. arXiv preprint arXiv:1904.12787. Cited by: §2.4.
- Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §2.2, §4.3, §4.
- Visualizing data using t-sne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: §5.7.
- Human-level control through deep reinforcement learning. nature 518 (7540), pp. 529–533. Cited by: §2.2.
- The pagerank citation ranking: bringing order to the web.. Technical report Stanford InfoLab. Cited by: §1, §4.2.2.
- Graph representation learning via graphical mutual information maximization. In Proceedings of The Web Conference 2020, pp. 259–270. Cited by: §1, item 10, item 7, §5.3.
- Deepwalk: online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 701–710. Cited by: §1, §2.1, item 8.
- Gmnn: graph markov neural networks. arXiv preprint arXiv:1905.06214. Cited by: item 12.
- Collective classification in network data. AI magazine 29 (3), pp. 93–93. Cited by: §5.1.
- Infograph: unsupervised and semi-supervised graph-level representation learning via mutual information maximization. arXiv preprint arXiv:1908.01000. Cited by: §1, §2.4.
- Line: large-scale information network embedding. In Proceedings of the 24th international conference on world wide web, pp. 1067–1077. Cited by: §1, §2.1.
- An atomistic fingerprint algorithm for learning ab initio molecular force fields. The Journal of Chemical Physics 148 (3), pp. 034101. Cited by: §4.2.2.
- Deep reinforcement learning and the deadly triad. arXiv preprint arXiv:1812.02648. Cited by: §2.2.
- Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §1, item 5.
- Deep graph infomax.. In ICLR (Poster), Cited by: §1, §2.1, §2.4, item 11, §5.2.
- Relational graph representation learning for open-domain question answering. arXiv preprint arXiv:1910.08249. Cited by: §1.
- Neural graph evolution: towards efficient automatic robot design. arXiv preprint arXiv:1906.05370. Cited by: §1.
- Graph wavelet neural network. arXiv preprint arXiv:1904.07785. Cited by: item 6.
Revisiting semi-supervised learning with graph embeddings. In International conference on machine learning, pp. 40–48. Cited by: item 2.
- Semi-supervised learning using gaussian fields and harmonic functions. In Proceedings of the 20th International conference on Machine learning (ICML-03), pp. 912–919. Cited by: item 1.