Log In Sign Up

Self-supervised Contrastive Attributed Graph Clustering

by   Wei Xia, et al.
Xidian University

Attributed graph clustering, which learns node representation from node attribute and topological graph for clustering, is a fundamental but challenging task for graph analysis. Recently, methods based on graph contrastive learning (GCL) have obtained impressive clustering performance on this task. Yet, we observe that existing GCL-based methods 1) fail to benefit from imprecise clustering labels; 2) require a post-processing operation to get clustering labels; 3) cannot solve out-of-sample (OOS) problem. To address these issues, we propose a novel attributed graph clustering network, namely Self-supervised Contrastive Attributed Graph Clustering (SCAGC). In SCAGC, by leveraging inaccurate clustering labels, a self-supervised contrastive loss, which aims to maximize the similarities of intra-cluster nodes while minimizing the similarities of inter-cluster nodes, are designed for node representation learning. Meanwhile, a clustering module is built to directly output clustering labels by contrasting the representation of different clusters. Thus, for the OOS nodes, SCAGC can directly calculate their clustering labels. Extensive experimental results on four benchmark datasets have shown that SCAGC consistently outperforms 11 competitive clustering methods.


page 2

page 9


Dual Contrastive Attributed Graph Clustering Network

Attributed graph clustering is one of the most important tasks in graph ...

Deep Graph Clustering via Mutual Information Maximization and Mixture Model

Attributed graph clustering or community detection which learns to clust...

GLCC: A General Framework for Graph-level Clustering

This paper studies the problem of graph-level clustering, which is a nov...

Self-Evolutionary Clustering

Deep clustering outperforms conventional clustering by mutually promotin...

GATCluster: Self-Supervised Gaussian-Attention Network for Image Clustering

Deep clustering has achieved state-of-the-art results via joint represen...

Supporting Clustering with Contrastive Learning

Unsupervised clustering aims at discovering the semantic categories of d...

Deep Fusion Clustering Network

Deep clustering is a fundamental yet challenging task for data analysis....

1 Introduction

In the era of Internet, network-structured data has penetrated into every corner of life. Representative examples include shopping networks Shchur et al. (2018), social networks Piao et al. (2021), recommendation systems Huang et al. (2021), citation networks Wan et al. (2021), etc.. Real-world scenarios such as these can be modeled as attributed graphs, i.e

., topological graphs structure with node attributes (or features). Due to non-Euclidean topological graph structure and complex node attribute, most existing machine learning approaches cannot be directly applied to analyze such data. To this end, graph neural networks (GNNs) 

Kipf and Welling (2017)

arises at the historic moment and have made great development in recent years. GNN aims to learn low-dimensional node representation for downstream tasks via simultaneously encoding the topological graph and node attribute. In this article, we will study the attributed graph clustering problem, which is one of the most challenging tasks in the fields of AI.

Figure 1: The framework of the proposed Self-supervised Contrastive Attributed Graph Clustering (SCAGC).

Attributed graph clustering, i.e., node clustering, aims to divide massive nodes into several disjoint clusters without intense manual guidance. To date, numerous attributed graph clustering methods have been proposed Wang et al. (2017); Zhang et al. (2019); Park et al. (2019); Cheng et al. (2020); Fan et al. (2020); Xia et al. (2021); Lin and Kang (2021), among which, most of them are based on graph auto-encoder (GAE) and variational GAE (VGAE) Kipf and Welling (2016). For example, to learn a robust node representation, the variants of GAE and VGAE are proposed by Pan et al. (2018, 2020), namely adversarially regularized graph auto-encoder (ARGA) and adversarially regularized variational graph auto-encoder (ARVGA). To build a clustering-directed network, inspired by deep embedding clustering (DEC) Xie et al. (2016)Wang et al. (2019) minimized the mismatch between clustering distribution and target distribution to improve the quality of node representation, and proposed deep attentional embedded graph clustering (DAEGC) approach. Similarly, Bo et al. (2020) presented structural deep clustering network (SDCN) to embed the topological structure into deep clustering. SDCN used the traditional auto-encoder to get new node feature via encoding node attribute, and then used GNN to simultaneously encode topological structure and new node feature to learn final node representation for clustering. Tu et al. (2021) proposed deep fusion clustering network (DFCN), which used a dynamic cross-modality fusion mechanism for obtaining consensus node representation, thereby generating more robust target distribution for network optimizing. Although aforementioned methods have made encouraging progress, how to mine the highly heterogeneous information embedded in the attribute graph remains to be explored.

Recently, due to its powerful unsupervised representation learning ability, contrastive learning (CL) has made vast inroads into computer vision community 

Chen et al. (2020); He et al. (2020). Motivated by this, several recent studies Velickovic et al. (2019); Sun et al. (2020); Zhang et al. (2021); Qiu et al. (2020); You et al. (2020); Zhu et al. (2021); Jin et al. (2021); Zhao et al. (2021) show promising results on unsupervised graph representation learning (GRL) using approaches related to CL, we call this kind of methods graph contrast representation learning methods (GCRL for short in this paper). For example, Velickovic et al. (2019) proposed deep graph information maximization (DGI) to learn node representation by contrasting the local node-level representation and the global graph-level representation. Similarly, Sun et al. (2020) proposed to learn graph-level representation by maximizing the mutual information between the graph-level representation and representations of substructures. Based on the contrastive loss in SimCLR Chen et al. (2020)You et al. (2020) proposed a new graph contrastive learning network with kinds of graph augmentation approaches (GraphCL) for facilitating node representation learning. More recently, Zhu et al. (2021) first used adaptive graph augmentation schemes to construct different graph views, then extracted node representation via maximizing the agreement of node representation between graph views.

Though driven by various motivations and achieved commendable results, many existing GCRL methods still have the following challenging issues:

  1. They are task-agnostic, thus, will need a post-processing to get clustering labels, resulting in suboptimal node representation for down-stream node clustering task.

  2. They fail to benefit from imprecise clustering labels, thus suffering from inferior performances.

  3. They cannot handle out-of-sample (OOS) nodes, which limits their application in practical engineering.

As shown in Figure 1

, we propose the self-supervised contrastive attributed graph clustering (SCAGC), a new attributed graph clustering approach that targets at addressing aforementioned limitations. In SCAGC, we first leverage graph augmentation methods to generate abundant attributed graph views, then, each augmented attributed graph has two compact representations: a clustering assignment probability produced by the clustering module and a low-dimension node representation produced by graph representation learning module. The two representations interact with each other and jointly evolve in an end-to-end framework. Specifically, the clustering module is trained via contrastive clustering loss to maximize the agreement between representations of the same cluster. The graph representation learning module is trained using the proposed self-supervised contrastive loss on pseudo labels,

i.e., clustering labels, where nodes within a same cluster are trained to have similar representations. We perform experiments on four attributed graph datasets and compare with 11 state-of-the-art GRL and GCRL methods. The proposed SCAGC substantially outperforms all baselines across all benchmarks. The main contribution of the proposed SCAGC is two-fold:

  1. To the best of our knowledge, SCAGC could be the first contrastive attributed graph clustering work without post-processing. SCAGC can directly predict the clustering assignment result of given unlabeled attributed graph. For OOS nodes, SCAGC can also directly calculate the clustering labels without retraining the entire attributed graph, which accelerates the implementation of SCAGC in practical engineering.

  2. By benefiting form the clustering labels, we propose a new self-supervised CL loss, which facilitates the graph representation learning. Extensive experimental results witness its effectiveness for attributed clustering.

2 Methodology

In this section, we first formalize the node clustering task on attributed graphs. Then, the overall framework of the proposed SCAGC will be introduced. Finally, we detail each component of the proposed network.

2.1 Problem Formalization

Given an arbitrary attributed graph , where is the vertex set, E is the edge set, is the node attribute matrix, N is the number of nodes, and d is the dimension of node attribute matrix. is the adjacency matrix of , and iff , i.e., there is an edge from node to .

In this article, we study one of the most representative downstream tasks of GNNs, i.e., node clustering. The target of node clustering is to divide the given N unlabeled nodes into K disjoint clusters , such that the node in the same cluster has high similarity to each other Cui et al. (2020); Xia et al. (2021).

2.2 Overall Network Architecture

As shown in Figure 1, the network architecture of the proposed SCAGC consists of the following joint optimization components: shared graph convolutional encoder, contrastive clustering module and self-supervised graph contrastive representation learning module.

  • Shared Graph Convolutional Encoder: It aims to simultaneously map the augmented node attribute and topological graph structure to a new low-dimensional space for downstream node clustering task.

  • Self-Supervised GCRL Module: To learn more discriminative graph representation and utilize the useful information embedded in inaccurate clustering labels, this module is designed to maximize the similarities of intra-cluster nodes, i.e., positive pairs, while minimizing the similarities of inter-cluster nodes, i.e., negative pairs.

  • Contrastive Clustering Module: To directly get clustering labels, this module builds a clustering network by contrasting the representation of different clusters.

2.3 Shared Graph Convolutional Encoder

Graph contrastive representation has attracted much attention, due to its ability to utilize graph augmentation schemes to generate positive and negative node pairs for representation learning You et al. (2020); Zhu et al. (2021). Specifically, given an arbitrary attributed graph with node attribute X and topological graph G, two stochastic graph augmentation schemes and are leveraged to construct two correlated attributed graph views {} and {}, where , and , is the v-th graph augmentation, denotes the set of all kinds of graph augmentation methods, including attribute masking, edge perturbation. To be specific, attribute masking randomly adds noise to node attributes, and edge perturbation randomly adds or drops edges in topological graph. The underlying prior of these two graph augmentation schemes is to keep the intrinsic topological structure and node attribute of attributed graph unchanged. Based on this prior, the learned node representation will be robust to perturbation on insignificant attributes and edges. In this article, we implement the graph augmentations following the setting in GCA Zhu et al. (2021).

After obtaining two augmented attributed graph views {} and {}, we utilize a shared two-layer graph convolutional network to simultaneously encode node attributes and topological graphs of augmented attributed graph views. Thus, we have


where is the 1-st layer’s output of shared GNN; is the node representation under the v-th graph augmentation; denotes the trainable parameter of graph convolutional encoder; ; ; I

is an identity matrix;

represents the nonlinear ReLU activation function.

So far, we have obtained the node representations and of two augmented attributed graph views.

Figure 2: The illustration of self-supervised CL. Taking the node as an example, the nodes in the same cluster have the same color. In (a), we find that traditional CL mistakenly regards the remaining four positive nodes (purple nodes) in and as negative nodes of .

2.4 Self-Supervised GCRL Module

In the field of GRL, contrastive learning based GRL has been an effective paradigm for maximizing the similarities of positive pairs while minimizing the similarities of negative pairs to learn discriminative graph representation. For a given attributed graph with N nodes, there are 2N augmented nodes. Traditional CL regard the representations of a node under two different augmentation as a positive pair, and leave other 2N-2 pairs to be negative (see Figure 2 (a)). While having promising performance, this assumption runs counter to the criterion of clustering. In node clustering, we hope that the nodes in the same cluster have high similarity to each other while the nodes in different clusters have low similarity to each other. However, existing methods fail to well consider this criterion, i.e., neglecting the existence of false-negative pairs .

In this article, by leveraging pseudo clustering labels , we can easily get the samples’ index of different clusters. As shown in Figure 2 (b), we aim to maximize the similarities of intra-cluster nodes, i.e., positive pairs, while minimizing the similarities of inter-cluster nodes, i.e., negative pairs. To this end, we first map the node representations and to obtain enhanced node representations and via a shared two-layer fully connected network with parameter , which also help to form and preserve more information in and , where , is the dimension of new node representation. After that, for the i

-th node, we propose a new self-supervised contrastive loss function, which is defined as


where is the temperature parameter, represents the i-th row of node representation . represents the set of nodes that belong to the same cluster as the i-th node, and is its cardinality, which can be obtained from the pseudo clustering assignment matrix . is the set of indices of all nodes except the i-th node.

Then, taking all nodes into account, the self-supervised contrastive loss is


2.5 Contrastive Clustering Module

How to obtain the clustering labels is crucial for downstream clustering task. Most existing methods directly implement classical clustering algorithms, e.g., K

-Means or spectral clustering, on the learned node representation to get clustering results. However, such strategy executes the node representation and clustering in two separated steps, which limits clustering performance. To this end, we build a clustering network to directly obtain the clustering labels. Specifically, as shown in Figure 

1, the clustering network is applied to transform the pattern structures of and

into probability distribution of clustering labels

and .

To share the parameters across augmentations, we execute and through a shared two-layer fully connected network with parameter . Under this setting, we can ensure and own the same coding scheme. Thus, is the output of clustering network under the 1-st augmented attributed graph view, and for the 2-nd augmented attributed graph view, where K is the number of clusters, represents the probability that assigning the i-th node to the k-the cluster .

For the obtained assignment matrices and , in the column direction, each column of is the representation of the k-th cluster. Thus, we should push closer the cluster representation of the same class, and also push far away the cluster representation of different class. That is to say, for the k-th cluster in each augmented attributed graph view, there is only one positive pair , and 2K-2 negative pairs. To this end, motivated by the great success of contrastive learning Chen et al. (2020), we leverage the contrastive loss function to implement this constraint. Thus, for the k-th cluster in the 1-st augmentation, we have



is parameter to control the softness. Given two vectors

f and s,

is the cosine similarity between them. In this article, we use the function

to measure the similarity of node pairs. Then, taking all positive pairs into account, the contrastive clustering loss is defined as


Moreover, to avoid trivial solution, i.e., making sure that all nodes could be evenly assigned into all clusters, similar to Li et al. (2021); Mao et al. (2021), we herein introduce a clustering regularizer , which is defined as


where .

In the proposed SCAGC training process, when we take the un-augmented attributed graph as the input of SCAGC, then we can get the clustering assignment matrix by discretizing the continuous output probability .

Remark 1

Solving out-of-sample nodes. For OOS nodes , SCAGC can directly take as input to calculate the clustering assignment matrix. While existing GRL and GCRL based methods is inefficient in OOS nodes , which require training the whole attributed graph, i.e., {}.

Input: Attributed graph with node attribute matrix X and adjacency matrix G, cluster number K, hyper-parameters , , , learning rate and maximum number of iterations .
Output: Clustering label .
1 Initialization: initialize the parameters of each component, the clustering assignment matrix by inputting raw attributed graph ;
// Training SCAGC
2 for  do
3      Sample two stochastic graph augmentation schemes and ;
4       Construct the augmented attributed graph views: where , , , and ;
5       Obtain variables , , , , and by forward propagation;
6       Calculate the overall objective with Eq. (8) and pseudo clustering label ;
7       Update network parameters via stochastic gradient ascent to minimize Eq. (8);
       // Update pseudo clustering label
8       if T % 5 ==0 then
9            Update the clustering assignment matrix by mapping raw attributed graph ;
10       end if
12 end for
// Obtain clustering results
13 Obtain the clustering assignment matrix by mapping raw attributed graph ;
return: Clustering label matrix .
Algorithm 1 Procedure for training SCAGC

2.6 Optimization

Finally, we integrate the aforementioned three sub-modules into an end-to end optimization framework, the overall objective function of SCAGC can be formulated as


where is a trade-off parameter. By optimizing Eq. (8), some nodes with correct labels will propagate useful information for graph representation learning, where the latter is used in turn to conduct the sub-sequent clustering. By this strategy, the node clustering and graph representation learning are seamlessly connected, with the aim to achieve better clustering results. We employ Adam optimizer Kingma and Ba (2015) with learning rate to optimize the proposed SCAGC, i.e., Eq. (8). Algorithm 1 presents the pseudo-code of optimizing the proposed SCAGC.

Dataset # Nodes # Attribute dimension # Edges # Classes Type Scale
ACM Tang et al. (2008) 3, 025 1, 870 29, 281 3 Paper relationship Small
DBLP Pan et al. (2016) 4, 057 334 5, 000, 495 4 Author relationship Small
Amazon-Photo Shchur et al. (2018) 7, 650 745 119, 081 8 Commodity purchase relationship Medium
Amazon-Computers Shchur et al. (2018) 13, 752 767 245, 861 10 Commodity purchase relationship Large
Table 1: Statistics of the real-world evaluation datasets.

3 Experiments

3.1 Experiment Setup

3.1.1 Benchmark Datasets

In this article, we use four real-world attributed graph datasets from different domains, e.g., academic network, shopping network, to evaluate the effectiveness of the proposed SCAGC, including ACM111, DBLP 222, Amazon-Photo333 and Amazon-Computers444 Table 1 presents detailed statistics of these datasets.

3.1.2 Baseline Methods

We compare clustering performance of the proposed SCAGC with 11 state-of-the-art node clustering methods, including the following three categories:

  1. Classical clustering methods: K-means, and spectral clustering (SC);

  2. Graph embedding clustering methods: GAE Kipf and Welling (2016), VGAE Kipf and Welling (2016), ARGA Pan et al. (2020), ARVGA Pan et al. (2020), DAEGC Wang et al. (2019), SDCN Bo et al. (2020), and DFCN Tu et al. (2021).

  3. GCRL based methods: GraphCL You et al. (2020) and GCA Zhu et al. (2021).

For the first category, K-means takes raw node attribute as input, and SC takes raw topological graph structure as input. As for the second and third categories, they take raw node attribute and topological graph structure as input. For GAE, VGAE, ARGA, ARVGA, SDCN, DFCN, GraphCL and GCA, the clustering assignment matrix is obtained by running K-means on the extracted node representation.

3.1.3 Evaluation Metrics

Similar to Bo et al. (2020); Tu et al. (2021), we leverage four commonly used metrics to evaluate the efficiency of all methods, i.e., accuracy (ACC), normalized mutual information (NMI), average rand index (ARI), and macro F1-score (F1). For these metrics, the higher the value, the better the performance.

Dataset ACM DBLP
Metric ACC () NMI () () ARI () ACC () NMI () () ARI ()
K-Means 67.26 0.75 31.91 0.35 54.47 0.32 30.76 0.62 39.08 0.36 10.11 0.21 38.01 0.37 7.28 0.29
SC 36.80 0.00 0.75 0.00 42.63 0.00 0.58 0.00 29.57 0.01 0.08 0.00 40.86 0.00 0.70 0.00
GAE (NeurIPS’ 16) 82.47 0.92 50.29 1.86 82.65 0.89 54.59 1.99 59.25 0.40 26.37 0.29 59.84 0.32 20.95 0.43
VGAE (NeurIPS’ 16) 82.85 0.63 50.22 1.24 82.85 0.62 55.56 1.15 62.22 0.83 26.62 1.37 60.70 0.85 25.08 1.23
ARGA (IEEE TC’ 20) 86.85 0.64 58.05 1.53 86.84 0.60 64.77 1.53 64.60 0.95 28.65 0.63 64.49 0.63 27.44 1.27
ARVGA (IEEE TC’ 20) 84.84 0.36 52.89 0.84 84.86 0.35 59.67 0.85 64.10 0.96 31.01 0.89 64.36 1.01 25.69 1.51
DAEGC (IJCAI’ 19) 87.18 0.05 59.32 0.12 87.27 0.05 65.46 0.12 75.87 0.46 42.45 0.58 75.41 0.45 46.80 0.87
SDCN (WWW’ 20) 89.44 0.26 65.89 0.95 89.40 0.28 71.47 0.67 71.91 0.57 37.80 1.06 71.21 0.73 40.45 1.18
DFCN (AAAI’ 21) 90.15 0.05 67.98 0.18 90.14 0.05 73.25 0.14 75.42 0.82 43.20 0.74 75.31 0.71 45.07 1.91
GraphCL (NeurIPS’ 20) 90.18 0.04 68.24 0.12 90.04 0.05 73.38 0.09 74.90 0.10 45.14 0.14 74.51 0.10 45.86 0.19
GCA (WWW’ 21) 88.95 0.26 65.33 0.56 89.07 0.26 69.82 0.67 73.90 0.48 41.35 0.79 72.91 0.76 43.65 0.65
SCAGC 91.83 0.03 71.28 0.06 91.84 0.03 77.29 0.07 79.42 0.02 49.05 0.02 78.88 0.02 54.04 0.03
Table 2: The clustering results on ACM and DBLP benchmarks. The best results in all methods and all baselines are represented by bold value and underline value, respectively.

3.1.4 Implementation Details

The proposed SCAGC and the baseline methods are implemented on a Windows 10 machine with an Intel (R) Xeon (R) Gold 6230 CPU and dual NVIDIA Tesla P100-PCIE GPUs. The deep learning environment consists of PyTorch 1.6.0 platform, PyTorch Geometric 1.6.1 platform, and TensorFlow 1.13.1. To ensure the availability of the initial pseudo clustering assignment matrix

, we pre-train the shared graph convolutional encoder and graph contrastive representation learning module via a classic contrastive learning loss.

The hyper-parameters of the proposed methods on each datasets are reported in supplementary material. In this article, we use the adaptive graph augmentation functions proposed by Zhu et al. (2021) to augment node attribute and topological structure. Notably, the degree centrality is used as the node centrality function to generate different topology graph views. The output size of shared graph convolutional encoder is set to 256, the output size of graph contrastive representation learning sub-network is set to 128, and the output size of contrastive clustering sub-network is set to be equal to the number of clusters K.

For all baseline methods, we follow the hyper-parameter settings as reported in their articles and run their released code to obtain the clustering results. To avoid the randomness of the clustering results, we repeat each experiment of SCAGC and baseline methods for 10 times and report their average values and the corresponding standard deviations.

Dataset Amazon-Photo Amazon-Computers
Metric ACC () NMI () () ARI () ACC () NMI () () ARI ()
K-Means 36.53 4.11 19.31 3.75 32.63 1.90 12.61 3.54 36.44 2.64 16.64 4.59 28.08 1.44 2.71 1.98
SC 25.58 0.02 0.60 0.02 5.50 0.00 0.03 0.00 36.47 0.01 0.37 0.02 5.81 0.00 0.59 0.00
GAE (NeurIPS’ 16) 42.03 0.54 31.87 0.51 34.01 0.42 19.31 0.53 43.14 1.74 35.47 1.58 27.06 2.63 19.61 1.85
VGAE (NeurIPS’ 16) 40.67 0.92 31.46 2.03 38.01 2.67 15.70 1.18 42.44 0.16 37.62 0.23 24.94 0.14 22.16 0.35
ARGA (IEEE TC’ 20) 57.79 2.26 48.01 1.65 52.56 2.68 34.44 1.58 45.67 0.37 37.21 0.92 40.02 1.29 26.28 1.02
ARVGA (IEEE TC’ 20) 47.89 1.36 41.37 1.39 42.96 1.46 27.72 1.06 47.16 0.26 38.84 0.96 41.51 0.83 27.27 0.84
DAEGC (IJCAI’ 19) 60.14 0.93 58.03 1.25 52.37 2.39 43.55 1.76 49.26 0.49 39.28 4.97 33.71 5.76 35.29 1.97
SDCN (WWW’ 20) 71.43 0.31 64.13 0.10 68.74 0.22 51.17 0.13 54.12 1.13 39.90 1.51 28.84 4.20 31.59 1.08
DFCN (AAAI’ 21) 73.43 0.61 64.74 1.04 69.96 0.49 52.39 1.01 56.24 0.16 41.83 0.40 33.39 1.11 33.02 0.39
GraphCL (NeurIPS’ 20) 66.61 0.56 57.35 0.32 58.52 0.55 45.13 0.44 50.22 0.66 41.78 2.44 32.89 2.16 36.94 3.20
GCA (WWW’ 21) 71.17 0.27 60.70 0.41 64.12 1.21 49.09 0.62 54.92 0.55 44.36 0.86 40.43 0.45 35.61 0.62
SCAGC 75.25 0.10 67.18 0.13 72.77 0.16 56.86 0.23 58.43 0.12 49.92 0.08 43.14 0.09 38.29 0.07
Table 3: The clustering results on Amazon-Photo and Amazon-Computers benchmarks. The best results in all methods and all baselines are represented by bold value and underline value, respectively.

3.2 Node Clustering Performance

Table 2 and Table 3 present the node clustering results of the proposed SCAGC and all baseline methods. From these results, we have the following observations:

  1. The proposed SCAGC and other GCN based methods (GAE, VGAE, ARGA, ARVGA, DAEGC, SDCN, DFCN, GraphCL, GCA) significantly and consistently outperforms K-Means and SC. The reason may be that GCN based methods simultaneously explore the information embedded in node attribute and topological graph structure. In contrast, these classical clustering methods only use the node attribute or topological structure. Moreover, compared with classical clustering methods, GCN based methods uses a multi-layer nonlinear graph neural network as the feature extractor, then map input data into a new subspace to carry out downstream clustering. These results well demonstrate the effectiveness of GCN on processing attributed graph data.

  2. The proposed SCAGC achieves much better clustering results than some representative graph auto-encoder (GAE, VGAE, ARGA, ARVGA). This is because compared with traditional graph auto-encoder, SCAGC leverages graph augmentation scheme to generate useful attributed graph, and take the relationship between positive pair and negative pair into account. These strategies help to improve the quality of node representation.

  3. In some cases, the clustering performance of GCL based baselines, i.e., GraphCL and GCA, are inferior to clustering-directed, i.e., DAEGC, SDCN, DFCN and the proposed SCAGC. This is because SCAGC integrate the node clustering and representation into an end-to-end framework, which helps to better explore the cluster structure. In contrast, GraphCL and GCA execute the node representation and clustering in two separated steps, which limits their performances.

  4. The proposed SCAGC consistently outperforms all the state-of-the-art baselines on all four datasets. Particularly, SCAGC surpasses the closest competitor GCA by 5.95% on ACM and 7.7% on DBLP, in terms of NMI. These remarkable performance verify the clustering ability of SCAGC. And it demonstrates that contrastive clustering module and self-supervised graph contrastive representation learning module are effective at benefiting the node representation learning and clustering.

(a) ACM
(b) DBLP
Figure 3: Ablation Studies on ACM and DBLP datasets.
Figure 4:

The t-SNE visualizations on the ACM (a-d) and IMDB (e-h) datasets with the increasing of the number of iteration.

Figure 5: The convergence of SCAGC on ACM dataset.

3.3 Ablation Studies

To better illustrate the effectiveness of different components in SCAGC, two ablation scenarios are implemented to further verify the effectiveness of contrastive clustering module, and the proposed self-supervised GCRL loss.

3.3.1 Effect of Contrastive Clustering Module

To better illustrate the effectiveness of contrastive clustering module, we compare the clustering results of SCAGC and SCAGC without contrastive clustering module (termed SCAGC w/o CCM) on ACM and DBLP datasets. Note that, in this scenario, SCAGC w/o CCM is trained using traditional contrastive loss Chen et al. (2020); Zhu et al. (2021), i.e., SCAGC w/o CCM is clustering-agnostic. As shown in Figure 3 (a-b), the clustering performances of SCAGC (see the red bar) are substantially superior to SCAGC w/o CCM (see the yellow bar). This is because SCAGC can better extract node representation benefiting from contrastive clustering module. While in the absence of the specific clustering task, SCAGC w/o CCM fails to explore the cluster structure, resulting in the quick drop of the performance of SCAGC.

3.3.2 Importance of the Proposed Self-Supervised GCRL Loss

To this end, we compare the clustering performances of SCAGC and SCAGC without self-supervised GCRL loss (termed SCAGC w/o SSC) on ACM and DBLP datasets. Note that, in this scenario, SCAGC w/o CCM is trained by replacing the first term of Eq. (8), i.e., Eq. (3), to a standard contrastive loss Chen et al. (2020); Zhu et al. (2021). As reported in Figure 3 (a-b), SCAGC (see red bar) always achieves the best performance in terms of all four metrics. These results demonstrate that pseudo label supervision guides the GCRL, thus, leveraging clustering labels are promising methods for unsupervised clustering task.

3.4 Model Discussion

3.4.1 Visualizations of Clustering Results

By simultaneously exploiting the good property of GCRL and taking advantage of the clustering labels, SCAGC ought to learn a discriminative node representation and desirable clustering label at the same time. To illustrate how SCAGC achieves the goal, as shown in Figure 4, we implement t-SNE van der Maaten and Hinton (2008) on the learned M at four different training iterations on ACM and DBLP datasets, where different colors indicated different clustering labels predicted by SCAGC. As observed, the cluster assignments become more reasonable, and different clusters scatter and gather more distinctly. These results indicate that the learned node representation become more compact and discriminative the increasing of the number of iteration.

3.4.2 Convergence Analysis

Taking ACM dataset as an example, we investigate the convergence of SCAGC. We record the objective values and clustering results of SCAGC with iteration and plot them in Figure 5. As shown in Figure 5, the objective values (see the blue line) decrease a lot in the first 100 iterations, then continuously decrease until convergence. Moreover, the ACC of SCAGC continuously increases to a maximum in the first 200 iterations, and generally maintain stable to slight variation. The curves in terms of NMI metric has a similar trend. These observations clearly indicate that SCAGC usually converges quickly.

4 Conclusion and Future Work

To conclude, we propose a novel self-supervised contrastive attributed clustering (SCAGC) approach, which can directly predict the clustering labels of unlabeled attributed graph and handle out-of-sample nodes. We also propose a new self-supervised contrastive loss based on imprecise clustering label to improve the quality of node representation. We believe that the proposed SCAGC will help facilitate the exploration of attributed graph where labels are time and labor consuming to acquire. In the future, we will study how to better explore reliable information embedded in imprecise clustering labels and use it to improve the contrastive loss.


  • D. Bo, X. Wang, C. Shi, M. Zhu, E. Lu, and P. Cui (2020) Structural deep clustering network. In WWW, pp. 1400–1410. Cited by: §1, item 2, §3.1.3.
  • T. Chen, S. Kornblith, M. Norouzi, and G. E. Hinton (2020) A simple framework for contrastive learning of visual representations. In ICML, pp. 1597–1607. Cited by: §1, §2.5, §3.3.1, §3.3.2.
  • J. Cheng, Q. Wang, Z. Tao, D. Xie, and Q. Gao (2020) Multi-view attribute graph convolution networks for clustering. In IJCAI, pp. 2973–2979. Cited by: §1.
  • G. Cui, J. Zhou, C. Yang, and Z. Liu (2020) Adaptive graph encoder for attributed graph embedding. In ACM SIGKDD, pp. 976–985. Cited by: §2.1.
  • S. Fan, X. Wang, C. Shi, E. Lu, K. Lin, and B. Wang (2020)

    One2Multi graph autoencoder for multi-view graph clustering

    In WWW, pp. 3070–3076. Cited by: §1.
  • K. He, H. Fan, Y. Wu, S. Xie, and R. B. Girshick (2020) Momentum contrast for unsupervised visual representation learning. In IEEE CVPR, pp. 9726–9735. Cited by: §1.
  • C. Huang, H. Xu, Y. Xu, P. Dai, L. Xia, M. Lu, L. Bo, H. Xing, X. Lai, and Y. Ye (2021) Knowledge-aware coupled graph neural network for social recommendation. In AAAI, pp. 4115–4122. Cited by: §1.
  • M. Jin, Y. Zheng, Y. Li, C. Gong, C. Zhou, and S. Pan (2021) Multi-scale contrastive siamese networks for self-supervised graph representation learning. In IJCAI, pp. 1477–1483. Cited by: §1.
  • D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In ICLR, Cited by: §2.6.
  • T. N. Kipf and M. Welling (2016) Variational graph auto-encoders. In NeurIPS Workshop on Bayesian Deep Learning, Cited by: §1, item 2.
  • T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. In ICLR, Cited by: §1.
  • Y. Li, P. Hu, J. Z. Liu, D. Peng, J. T. Zhou, and X. Peng (2021) Contrastive clustering. In AAAI, pp. 8547–8555. Cited by: §2.5.
  • Z. Lin and Z. Kang (2021) Graph filter-based multi-view attributed graph clustering. In IJCAI, pp. 2723–2729. Cited by: §1.
  • Y. Mao, X. Yan, Q. Guo, and Y. Ye (2021) Deep mutual information maximin for cross-modal clustering. In AAAI, pp. 8893–8901. Cited by: §2.5.
  • S. Pan, R. Hu, S. Fung, G. Long, J. Jiang, and C. Zhang (2020) Learning graph embedding with adversarial training methods. IEEE Trans. Cybern. 50 (6), pp. 2475–2487. Cited by: §1, item 2.
  • S. Pan, R. Hu, G. Long, J. Jiang, L. Yao, and C. Zhang (2018) Adversarially regularized graph autoencoder for graph embedding. In IJCAI, pp. 2609–2615. Cited by: §1.
  • S. Pan, J. Wu, X. Zhu, C. Zhang, and Y. Wang (2016) Tri-party deep network representation. In IJCAI, pp. 1895–1901. Cited by: Table 1.
  • J. Park, M. Lee, H. J. Chang, K. Lee, and J. Y. Choi (2019) Symmetric graph convolutional autoencoder for unsupervised graph representation learning. In IEEE ICCV, pp. 6518–6527. Cited by: §1.
  • J. Piao, G. Zhang, F. Xu, Z. Chen, and Y. Li (2021) Predicting customer value with social relationships via motif-based graph attention networks. In WWW, pp. 3146–3157. Cited by: §1.
  • J. Qiu, Q. Chen, Y. Dong, J. Zhang, H. Yang, M. Ding, K. Wang, and J. Tang (2020) GCC: graph contrastive coding for graph neural network pre-training. In ACM SIGKDD, pp. 1150–1160. Cited by: §1.
  • O. Shchur, M. Mumme, A. Bojchevski, and S. Günnemann (2018) Pitfalls of graph neural network evaluation. In NeurIPS Workshop on Relational Representation Learning, Cited by: §1, Table 1.
  • F. Sun, J. Hoffmann, V. Verma, and J. Tang (2020) InfoGraph: unsupervised and semi-supervised graph-level representation learning via mutual information maximization. In ICLR, Cited by: §1.
  • J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su (2008) ArnetMiner: extraction and mining of academic social networks. In ACM SIGKDD, pp. 990–998. Cited by: Table 1.
  • W. Tu, S. Zhou, X. Liu, X. Guo, Z. Cai, E. Zhu, and J. Cheng (2021) Deep fusion clustering network. In AAAI, pp. 9978–9987. Cited by: §1, item 2, §3.1.3.
  • L. van der Maaten and G. Hinton (2008) Visualizing data using t-sne. Journal of Machine Learning Research 9 (86), pp. 2579–2605. Cited by: §3.4.1.
  • P. Velickovic, W. Fedus, W. L. Hamilton, P. Liò, Y. Bengio, and R. D. Hjelm (2019) Deep graph infomax. In ICLR, Cited by: §1.
  • S. Wan, S. Pan, J. Yang, and C. Gong (2021)

    Contrastive and generative graph convolutional networks for graph-based semi-supervised learning

    In AAAI, pp. 10049–10057. Cited by: §1.
  • C. Wang, S. Pan, R. Hu, G. Long, J. Jiang, and C. Zhang (2019) Attributed graph clustering: A deep attentional embedding approach. In IJCAI, pp. 3670–3676. Cited by: §1, item 2.
  • C. Wang, S. Pan, G. Long, X. Zhu, and J. Jiang (2017) MGAE: marginalized graph autoencoder for graph clustering. In CIKM, pp. 889–898. Cited by: §1.
  • W. Xia, Q. Wang, Q. Gao, X. Zhang, and X. Gao (2021) Self-supervised graph convolutional network for multi-view clustering. IEEE Trans. Multim. doi: 10.1109/TMM.2021.3094296. Cited by: §1, §2.1.
  • J. Xie, R. B. Girshick, and A. Farhadi (2016)

    Unsupervised deep embedding for clustering analysis

    In ICML, Vol. 48, pp. 478–487. Cited by: §1.
  • Y. You, T. Chen, Y. Sui, T. Chen, Z. Wang, and Y. Shen (2020) Graph contrastive learning with augmentations. In NeurIPS, Cited by: §1, §2.3, item 3.
  • R. Zhang, C. Lu, Z. Jiao, and X. Li (2021) Deep contrastive graph representation via adaptive homotopy learning. CoRR abs/2106.09244. Cited by: §1.
  • X. Zhang, H. Liu, Q. Li, and X. Wu (2019) Attributed graph clustering via adaptive graph convolution. In IJCAI, pp. 4327–4333. Cited by: §1.
  • H. Zhao, X. Yang, Z. Wang, E. Yang, and C. Deng (2021) Graph debiased contrastive learning with joint representation clustering. In IJCAI, pp. 3434–3440. Cited by: §1.
  • Y. Zhu, Y. Xu, F. Yu, Q. Liu, S. Wu, and L. Wang (2021) Graph contrastive learning with adaptive augmentation. In WWW, pp. 2069–2080. Cited by: §1, §2.3, item 3, §3.1.4, §3.3.1, §3.3.2.