Introduction
Deep clustering, which aims to train a neural network for learning discriminative feature representations to divide data into several disjoint groups without intense manual guidance, is becoming an increasingly appealing direction to the machine learning researchers. Thanks to the strong representation learning capability of deep learning methods, researches in this field have achieved promising performance in many applications including anomaly detection
Markovitz et al. (2020), social network analysis Hu et al. (2017), and face recognition
Wang et al. (2019b). Two important factors, i.e., the optimization objective and the fashion of feature extraction, significantly determine the performance of a deep clustering method. Specifically, in the unsupervised clustering scenario, without the guidance of labels, designing a subtle objective function and an elegant architecture to enable the network to collect more comprehensive and discriminative information for intrinsic structure revealing is extremely crucial and challenging.
According to the network optimization objective, existing deep clustering methods can be roughly grouped into five categories, i.e., subspace clusteringbased methods Zhou et al. (2019a); Ji et al. (2017); Peng et al. (2017)
, generative adversarial networkbased methods
Mukherjee et al. (2019); Ghasedi et al. (2019), spectral clusteringbased methods
Yang et al. (2019b); Shaham et al. (2018), Gaussian mixture modelbased methods
Yang et al. (2019a); Chen et al. (2019), and selfoptimizingbased methods Xie et al. (2016); Guo et al. (2017). Our method falls into the last category. In the early state, the above deep clustering methods mainly concentrate on exploiting the attribute information in the original feature space of data and have achieved good performance in many circumstances. To further improve the clustering accuracy, recent literature shows a strong tendency in extracting geometrical structure information and then integrates it with attribute information for representation learning. Specifically, Yang et al. design a novel stochastic extension of graph embedding to add local data structures into probabilistic deep Gaussian mixture model (GMM) for clustering Yang et al. (2019a). Distribution preserving subspace clustering (DPSC) first estimates the density distribution of the original data space and the latent embedding space with kernel density estimation. Then it preserves the intrinsic cluster structure within data by minimizing the distribution inconsistency between the two spaces
Zhou et al. (2019a). More recently, graph convolutional networks (GCNs), which aggregate the neighborhood information for better sample representation learning, have attracted the attention of many researchers. The work in deep attentional embedded graph clustering (DAEGC) exploits both graph structure and node attributes with a graph attention encoder. It reconstructs the adjacency matrix by a selfoptimizing embedding method Wang et al. (2019a). Following the setting of DAEGC, adversarially regularized graph autoencoder (ARGA) further develops an adversarial regularizer to guide the learning of latent representations Pan et al. (2020). After that, structural deep clustering network (SDCN) Bo et al. (2020)integrates an autoencoder and a graph convolutional network into a unified framework by designing an information passing delivery operator and a dual selfsupervised learning mechanism.
Although the former efforts have achieved preferable performance enhancement by leveraging both kinds of information, we find that 1) the existing methods lack an crossmodality dynamic information fusion and processing mechanism. Information from two sources is simply aligned or concatenated, leading to insufficient information interaction and merging; 2) the generation of the target distribution in existing literature has seldom used information from both sources, making the guidance of network training less comprehensive and accurate. As a consequence, the negotiation between two information sources is obstructed, resulting in unsatisfying clustering performance.
To tackle the above issues, we propose a deep fusion clustering network (DFCN). The main idea of our solution is to design a dynamic information fusion module to finely process the attribute and structure information extracted from autoencoder (AE) and graph autoencoder (GAE) for a more comprehensive and accurate representation construction. Specifically, a structure and attribute information fusion (SAIF) module is carefully designed for elaborating bothsource information processing. Firstly, we integrate two kinds of sample embeddings in both the perspective of local and global level for consensus representation learning. After that, by estimating the similarity between sample points and precalculated cluster centers in the latent embedding space with Students’ tdistribution, we acquire more precise target distribution. Finally, we design a triplet selfsupervision mechanism which uses the target distribution to provide more dependable guidance for AE, GAE, and the information fusion part simultaneously. Moreover, we develop an improved graph autoencoder (IGAE) with a symmetric structure and reconstruct the adjacency matrix with both the latent representations and the feature representations reconstructed by the graph decoder. The key contributions of this paper are listed as follows:

We propose a deep fusion clustering network (DFCN). In this network, a structure and attribute information fusion (SAIF) module is designed for better information interaction between AE and GAE. With this module, 1) since both the decoders of AE and GAE reconstruct the inputs using a consensus latent representation, the generalization capacity of the latent embeddings is boosted. 2) The reliability of the generated target distribution is enhanced by integrating the complementary information between AE and GAE. 3) The selfsupervised triplet learning mechanism integrates the learning of AE, GAE and the fusion part in a unified and robust system, thus further improves the clustering performance.

We develop a symmetric graph autoencoder, i.e., improved graph autoencoder (IGAE), to further improve the generalization capability of the proposed method.

Extensive experiment results on six public benchmark datasets have demonstrated that our method is highly competitive and consistently outperforms the stateoftheart ones with a preferable margin.
Related Work
Attributed Graph Clustering
Benefiting from the strong representation power of graph convolutional networks (GCNs) Kipf and Welling (2017), GCNbased clustering methods that jointly learn graph structure and node attributes have been widely studied in recent years Fan et al. (2020); Cheng et al. (2020); Sun et al. (2020). Specifically, graph autoencoder (GAE) and variational graph autoencoder (VGAE) are proposed to integrate graph structure into node attributes via iteratively aggregating neighborhood representations around each central node Kipf and Welling (2016). After that, ARGA Pan et al. (2020), AGAE Tao et al. (2019), DAEGC Wang et al. (2019a), and MinCutPool Bianchi et al. (2020) improve the performance of the earlystage methods with adversarial training, attention, and graph pooling mechanisms, respectively. Although the performance of the corresponding methods has been improved considerably, the oversmoothing phenomenon of the GCNs still limits the accuracy of these methods. More recently, SDCN Bo et al. (2020) is proposed to integrate autoencoder and GCN module for better representation learning. Through careful theoretical and experimental analysis, authors find that in their proposed network, autoencoder can help provide complementary attribute information and help relieve the oversmoothing phenomenon of GCN module, while GCN module provides highorder structure information to autoencoder. Although SDCN proves that combining autoencoder and GCN module can boost the clustering performance of both components, in this work, the GCN module acts only as a regularizer of the autoencoder. Thus, the learned features of the GCN module are insufficiently utilized for guiding the selfoptimizing network training and the representation learning of the framework lacks the negotiation between the two subnetworks. Differently, in our proposed method, an information fusion module (i.e., SAIF module) is proposed to integrate and refine the features learned by the AE and IGAE. As a consequence, the complementary information from two subnetworks is finely merged to reach a consensus, and more discriminative representations are learned.
Target Distribution Generation
Since reliable guidance is missing in clustering network training, many deep clustering methods seek to generate the target distribution (i.e., “groundtruth” soft labels) for discriminative representation learning in a selfoptimizing manner Ren et al. (2019); Xu et al. (2019); Li et al. (2019). The early method (DEC) in this category first trains an encoder, and then with the pretrained network, it further defines a target distribution based on the Student’s tdistribution and finetunes the network with stronger guidance Xie et al. (2016). To increase the accuracy of the target distribution, IDEC jointly optimizes the cluster assignment and learns features that are suitable for clustering with local structure preservation Guo et al. (2017). After that, to better train the autoencoder and GCN module integrated network, SDCN designs a dual selfsupervised learning mechanism which conducts target distribution refinement and subnetwork training in a unified system Bo et al. (2020). Despite their success, existing methods generate the target distribution with only the information of autoencoder or GCN module. None of them considers combining the information from both sides and then comes up with a more robust guidance, thus the generated target distribution could be less comprehensive. In contrast, in our method, as the information fusion module allows the information from the two subnetworks to adequately interact with each other, the resultant target distribution has the potential to be more reliable and robust than that of the singlesource counterparts.
The Proposed Method
Our proposed method mainly consists of four parts, i.e., an autoencoder, an improved graph autoencoder, a fusion module, and the optimization targets (please check Fig. 1 for the diagram of our network structure). The encoder part of both AE and IGAE are similar with that of the existing literature. In the following sections, we will first introduce the basic notations and then introduce the decoder of both subnetworks, the fusion module, and the optimization targets in detail.
Notations
Given an undirected graph with cluster centers, and are the node set and the edge set, respectively, where is the number of samples. The graph is characterized by its attribute matrix and original adjacency matrix . Here, is the attribute dimension and if , otherwise . The corresponding degree matrix is and . With , the original adjacency matrix is further normalized as through calculating , where indicates that each node in is linked with a selfloop structure. All notations are summarized in Table 1.
Notations  Meaning 
Attribute matrix  
Original adjacency matrix  
Identity matrix  
Normalized adjacency matrix  
Degree matrix  
Reconstructed weighted attribute matrix  
Reconstructed adjacency matrix  
Latent embedding of AE  
Latent embedding of IGAE  
Initial fused embedding  
Local structure enhanced  
Normalized selfcorrelation matrix  
Global structure enhanced  
Clustering embedding  
Soft assignment distribution  
Target distribution 
Fusionbased Autoencoders
Input of the Decoder. Most of the existing autoencoders, either classic autoencoder or graph autoencoder, reconstruct the inputs with only its own latent representations. However, in our proposed method, with the compressed representations of AE and GAE, we first integrate the information from both sources for a consensus latent representation. Then, with this embedding as an input, both the decoders of AE and GAE reconstruct the inputs of two subnetworks. This is very different from the existing methods that our proposed method fuses heterogeneous structure and attribute information with a carefully designed fusion module and then reconstructs the inputs of both subnetworks with the consensus latent representation. Detailed information about the fusion module will be introduced in the Structure and Attribute Information Fusion section.
Improved Graph Autoencoder. In the existing literature, the classic autoencoders are usually symmetric, while graph convolutional networks are usually asymmetric Kipf and Welling (2016); Wang et al. (2019a); Tao et al. (2019). They require only the latent representation to reconstruct the adjacency information and overlook that the structurebased attribute information can also be exploited for improving the generalization capability of the corresponding network. To better make use of both the adjacency information and the attribute information, we design a symmetric improved graph autoencoder (IGAE). This network requires to reconstruct both the weighted attribute matrix and the adjacency matrix simultaneously. In the proposed IGAE, a layer in the encoder and decoder is formulated as:
(1) 
(2) 
where and denote the learnable parameters of the lth encoder layer and hth decoder layer.
is a nonlinear activation function, such as ReLU or Tanh. To minimize both the reconstruction loss functions over the weighted attribute matrix and the adjacency matrix, our IGAE is designed to minimize a hybrid loss function:
(3) 
In Eq.(3), is a predefined hyperparameter that balances the weight of the two reconstruction loss functions. Specially, and are defined as follows:
(4) 
(5) 
In Eq.(4), is the reconstructed weighted attribute matrix. In Eq.(5), is the reconstructed adjacency matrix generated by an inner product operation with multilevel representations of the network. By minimizing both Eq.(4) and Eq.(5), the proposed IGAE is termed to minimize the reconstruction loss over the weighted attribute matrix and the adjacency matrix at the same time. Experimental results in the following parts validate the effectiveness of this setting.
Structure and Attribute Information Fusion
To sufficiently explore the graph structure and node attributes information extracted by the AE and IGAE, we propose a structure and attribute information fusion (SAIF) module. This module consists of two parts, i.e., a crossmodality dynamic fusion mechanism and a triplet selfsupervised strategy. The overall structure of SAIF is illustrated in Fig. 2.
Crossmodality Dynamic Fusion Mechanism.
The information integration within our fusion module includes four steps. First, we combine the latent embedding of AE () and IGAE () with a linear combination operation:
(6) 
where is the latent embedding dimension, and is a learnable coefficient which selectively determines the importance of two information sources according to the property of the corresponding dataset. In our paper, is initialized as and then tuned automatically with a gradient decent method.
Then, we process the combined information with a graph convolutionlike operation (i.e., message passing operation). With this operation, we enhance the initial fused embedding by considering the local structure within data:
(7) 
In Eq.(7), denotes the local structure enhanced .
After that, we further introduce a selfcorrelated learning mechanism to exploit the nonlocal relationship in the preliminary information fusion space among samples. Specifically, we first calculate the normalized selfcorrelation matrix through Eq.(8):
(8) 
With as coefficients, we recombine by considering the global correlation among samples: .
Finally, we adopt a skip connection to encourage information to pass smoothly within the fusion mechanism:
(9) 
where is a scale parameter. Following the setting in Fu et al. (2019), we initialize it as 0 and learn its weight while training the network. Technically, our crossmodality dynamic fusion mechanism considers the sample correlation in both the perspective of the local and global level. Thus, it has potential benefit on finely fusing and refining the information from both AE and IGAE for learning consensus latent representations.
Triplet Selfsupervised Strategy.
To generate more reliable guidance for clustering network training, we first adopt the more robust clustering embedding which has integrated the information from both AE and IGAE for target distribution generation. As shown in Eq.(10) and Eq.(11), the generation process includes two steps:
(10) 
(11) 
In the first step (corresponding to Eq.(10)), we calculate the similarity between the th sample () and the th precalculated clustering center () in the fused embedding space using Student’s tdistribution as kernel. In Eq.(10),
is the degree of freedom for Student’s
tdistribution andindicates the probability of assigning the
ith node to the jth center (i.e., a soft assignment). The soft assignment matrix reflects the distribution of all samples. In the second step, to increase the confidence of cluster assignment, we introduce Eq.(11) to drive all samples to get closer to cluster centers. Specifically, is an element of the generated target distribution , which indicates the probability of the th sample belongs to the th cluster center.With the iteratively generated target distribution, we then calculate the soft assignment distribution of AE and IGAE by using Eq.(10) over the latent embeddings of two subnetworks, respectively. We denote the soft assignment distribution of IGAE and AE as and .
To train the network in a unified framework and improve the representative capability of each component, we design a triplet clustering loss by adapting the KLdivergence in the following form:
(12) 
In this formulation, the summation of soft assignment distribution of AE, IGAE, and the fused representations are aligned with the robust target distribution simultaneously. Since the target distribution is generated without human guidance, we name the loss function triplet clustering loss and the corresponding training mechanism as triplet selfsupervised strategy.
Joint loss and Optimization
The overall learning objective consists of two main parts, i.e., the reconstruction loss of AE and IGAE, and the clustering loss which is correlated with the target distribution:
(13) 
In Eq.(13), is the mean square error (MSE) reconstruction loss of AE. Different from SDCN, the proposed DFCN reconstructs the inputs of both subnetworks with the consensus latent representation. is a predefined hyperparameter which balances the importance of reconstruction and clustering. The detailed learning procedure of the proposed DFCN is shown in Algorithm 1.
Experiments
Benchmark Datasets
We evaluate the proposed DFCN on six popular public datasets, including three graph datasets (ACM^{3}^{3}3 http://dl.acm.org/, DBLP^{4}^{4}4https://dblp.unitrier.de, and CITE^{5}^{5}5 http://citeseerx.ist.psu.edu/index) and three nongraph datasets (USPS LeCun et al. (1990), HHAR Lewis et al. (2004), and REUT Stisen et al. (2015)). Table 2
summarizes the brief information of these datasets. For the dataset (like USPS, HHAR, and REUT) whose affinity matrix is absent, we follow
Bo et al. (2020) and construct the matrix with heat kernel method.Experiment Setup
Training Procedure
Our method is implemented with PyTorch platform and a NVIDIA 2080TI GPU. The training of the proposed DFCN includes three steps. First, we pretrain the AE and IGAE independently for 30 iterations by minimizing the reconstruction loss functions. Then, both subnetworks are integrated into a united framework for another 100 iterations. Finally, with the learned centers of different clusters and under the guidance of the triplet selfsupervised strategy, we train the whole network for at least 200 iterations until convergence. The cluster ID is acquired by performing Kmeans algorithm over the consensus clustering embedding
. Following all the compared methods, to alleviate the adverse influence of randomness, we repeat each experiment for 10 times and report the average values and the corresponding standard deviations.
Dataset  Type  Samples  Classes  Dimension 
USPS  Image  9298  10  256 
HHAR  Record  10299  6  561 
REUT  Text  10000  4  2000 
ACM  Graph  3025  3  1870 
DBLP  Graph  4058  4  334 
CITE  Graph  3327  6  3703 
Data  Metric  Kmeans  AE  DEC  IDEC  GAE  VGAE  ARGA  DAEGC  SDCN  SDCN  DFCN 
USPS  ACC  66.8  71.0  73.3  76.2  63.1  56.2  66.8  73.6  77.1  78.1  79.5 
NMI  62.6  67.5  70.6  75.6  60.7  51.1  61.6  71.1  77.7  79.5  82.8  
ARI  54.6  58.8  63.7  67.9  50.3  41.0  51.1  63.3  70.2  71.8  75.3  
F1  64.8  69.7  71.8  74.6  61.8  53.6  66.1  72.5  75.9  77.0  78.3  
HHAR  ACC  60.0  68.7  69.4  71.1  62.3  71.3  63.3  76.5  83.5  84.3  87.1 
NMI  58.9  71.4  72.9  74.2  55.1  63.0  57.1  69.1  78.8  79.9  82.2  
ARI  46.1  60.4  61.3  62.8  42.6  51.5  44.7  60.4  71.8  72.8  76.4  
F1  58.3  66.4  67.3  68.6  62.6  71.6  61.1  76.9  81.5  82.6  87.3  
REUT  ACC  54.0  74.9  73.6  75.4  54.4  60.9  56.2  65.6  79.3  77.2  77.7 
NMI  41.5  49.7  47.5  50.3  25.9  25.5  28.7  30.6  56.9  50.8  59.9  
ARI  28.0  49.6  48.4  51.3  19.6  26.2  24.5  31.1  59.6  55.4  59.8  
F1  41.3  61.0  64.3  63.2  43.5  57.1  51.1  61.8  66.2  65.5  69.6  
ACM  ACC  67.3  81.8  84.3  85.1  84.5  84.1  86.1  86.9  87.0  90.5  90.9 
NMI  32.4  49.3  54.5  56.6  55.4  53.2  55.7  56.2  58.9  68.3  69.4  
ARI  30.6  54.6  60.6  62.2  59.5  57.7  62.9  59.4  65.3  73.9  74.9  
F1  67.6  82.0  84.5  85.1  84.7  84.2  86.1  87.1  86.8  90.4  90.8  
DBLP  ACC  38.7  51.4  58.2  60.3  61.2  58.6  61.6  62.1  65.7  68.1  76.0 
NMI  11.5  25.4  29.5  31.2  30.8  26.9  26.8  32.5  35.1  39.5  43.7  
ARI  7.0  12.2  23.9  25.4  22.0  17.9  22.7  21.0  34.0  39.2  47.0  
F1  31.9  52.5  59.4  61.3  61.4  58.7  61.8  61.8  65.8  67.7  75.7  
CITE  ACC  39.3  57.1  55.9  60.5  61.4  61.0  56.9  64.5  61.7  66.0  69.5 
NMI  16.9  27.6  28.3  27.2  34.6  32.7  34.5  36.4  34.4  38.7  43.9  
ARI  13.4  29.3  28.1  25.7  33.6  33.1  33.4  37.8  35.5  40.2  45.5  
F1  36.1  53.8  52.6  61.6  57.4  57.7  54.8  62.2  57.8  63.6  64.3 
Parameters Setting
For ARGA Pan et al. (2020), we set the parameters of the method by following the setting of the original paper. For other compared methods, we report the results listed in the paper SDCN Bo et al. (2020) directly. For our method, we adopt the original code and data of SDCN for data preprocessing and testing. All ablation studies are trained with the Adam optimizer. The optimization stops when the validation loss comes to a plateau. The learning rate is set to 1e3 for USPS, HHAR, 1e4 for REUT, DBLP, and CITE, and 5e5 for ACM. The training batch size is set to 256 and we adopt an early stop strategy to avoid overfitting. According to the results of parameter sensitivity testing, we fix two balanced hyperparameters and to 0.1 and 10, respectively. Moreover, we set the nearest neighbors number of each node as 5 for all nongraph datasets.
Evaluation Metric
The clustering performance of all methods is evaluated by four metrics: Accuracy (ACC), Normalized Mutual Information (NMI), Average Rand Index (ARI), and macro F1score (F1) Zhou et al. (2020, 2019b); Liu et al. (2020a, b, 2019). The best map between cluster ID and class ID is found by using the KuhnMunkres algorithm Lovász and Plummer (1986).
Comparison with the Stateoftheart Methods
In this part, we compare our proposed method with ten stateoftheart clustering methods to illustrate its effectiveness. Among them, Kmeans Hartigan and Wong (1979) is the representative one of classic shallow clustering methods. AE Hinton and Salakhutdinov (2006), DEC Xie et al. (2016), and IDEC Guo et al. (2017) represent the autoencoderbased clustering methods which learn the representations for clustering through training an autoencoder. GAE/VGAE Kipf and Welling (2016), ARGA Pan et al. (2020), and DAEGC Wang et al. (2019a) are typical methods of graph convolutional networkbased methods. In these methods, the clustering representation is embedded with structure information by GCN. SDCN and SDCN Bo et al. (2020) are representatives of hybrid methods which take advantage of both AE and GCN module for clustering.
The clustering performance of our method and 10 baseline methods on six benchmark datasets are summarized in Table 3. Based on the results, we have the following observations:
1) DFCN shows superior performance against the compared methods in most circumstances. Specifically, Kmeans performs clustering on raw data. AE, DEC, and IDEC merely exploit node attribute representations for clustering. These methods seldom take structure information into account, leading to suboptimal performance. In contrast, DFCN successfully leverages available data by selectively integrating the information of graph structure and node attributes, which complements each other for consensus representation learning and greatly improves clustering performance.
2) It is obvious that GCNbased methods such as GAE, VGAE, ARGA, and DAEGC are not comparable to ours, because these methods underutilize abundant information from data itself and might be limited to the oversmoothing phenomenon. Differently, DFCN incorporates attributebased representations learned by AE into the whole clustering framework, and mutually explores graph structure and node attributes with a fusion module for consensus representation learning. As a result, the proposed DFCN improves the clustering performance of the existing GCNbased methods with a preferable gap.
3) DFCN achieves better clustering results than the strongest baseline methods SDCN and SDCN in the majority of cases, especially on HHAR, DBLP, and CITE datasets. On DBLP dataset for instance, our method achieves a 7.9%, 4.2%, 7.8%, and 8.0% increment with respect to ACC, NMI, ARI and F1 against SDCN. This is because DFCN not only achieves a dynamic interaction between graph structure and node attributes to reveal the intrinsic clustering structure, but also adopts a triplet selfsupervised strategy to provide precise network training guidance.
Dataset  Model  ACC  NMI  ARI  F1 
USPS  +AE  78.3  81.3  73.6  76.8 
+IGAE  76.9  77.1  68.8  74.8  
DFCN  79.5  82.8  75.3  78.3  
HHAR  +AE  75.2  82.8  71.7  72.6 
+IGAE  82.8  79.6  72.3  83.4  
DFCN  87.1  82.2  76.4  87.3  
REUT  +AE  69.3  48.5  44.6  58.3 
+IGAE  71.4  52.5  49.1  61.5  
DFCN  77.7  59.9  59.8  69.6  
ACM  +AE  90.2  67.5  73.2  90.2 
+IGAE  89.6  65.6  71.8  89.6  
DFCN  90.9  69.4  74.9  90.8  
DBLP  +AE  64.2  30.2  29.4  64.6 
+IGAE  67.5  34.2  31.5  67.6  
DFCN  76.0  43.7  47.0  75.7  
CITE  +AE  69.3  42.9  44.7  64.4 
+IGAE  67.9  41.8  43.0  63.7  
DFCN  69.5  43.9  45.5  64.3 
Ablation Studies
Effectiveness of IGAE
We further conduct ablation studies to verify the effectiveness of IGAE and report the results in Fig. 3. GAEL or GAEL denotes the method optimized by the reconstruction loss function of weighted attribute matrix or adjacency matrix only. We can find out that GAEL consistently performs better than GAEL on six datasets. Besides, IGAE clearly improves the clustering performance over the method which constructs the adjacency matrix only. Both observations illustrate that our proposed reconstruction measure is able to exploit more comprehensive information for improving the generalization capability of the deep clustering network. By this means, the latent embedding inherits more properties from the attribute space of the original graph, preserving representative features that generate better clustering decisions.
Analysis of the SAIF Module
In this part, we conduct several experiments to verify the effectiveness of the SAIF module. As summarized in Fig. 4, we observe that 1) compared with the baseline, BaselineC method has about 0.5% to 5.0% performance improvements, indicating that exploring graph structure and node attributes in both the perspective of the local and global level is helpful to learn consensus latent representations for better clustering; 2) the performance of BaselineCT method is consistently better than that of BaselineCS method on all datasets. The reason is that our triplet selfsupervised strategy successfully generates more reliable guidance for the training of AE, IGAE, and the fusion part, making them benefit from each other. According to these observations, the superiority of the SAIF module has clearly been demonstrated over the baseline.
Influence of Exploiting Bothsource Information
We compare our method with two variants to validate the effectiveness of complementary twomodality (structure and attribute) information learning for target distribution generation. As reported in Table 4, +AE or +IGAE refers to the DFCN with only AE or IGAE part, respectively. On one hand, as +AE and +IGRE achieve better performance on different datasets, it indicates that information from either AE or IGAE cannot consistently outperform that of their counterparts, combining the bothsource information can potentially improve the robustness of the hybrid method. On the other hand, DFCN encodes both DNN and GCNbased representations and consistently outperforms the singlesource methods. This shows that 1) bothsource information is equally essential for the performance improvement of DFCN; 2) DFCN can facilitate the complementary twomodality information to make the target distribution more reliable and robust for better clustering.
Analysis of Hyperparameter
As can be seen in Eq.(13), DFCN introduces a hyperparameter to make a tradeoff between the reconstruction and clustering. We conduct experiments to show the effect of this parameter on all datasets. Fig. 5 illustrates the performance variation of DFCN when varies from 0.01 to 100. From these figures, we observe that 1) the hyperparameter is effective in improving the clustering performance; 2) the performance of the method is stable in a wide range of ; 3) DFCN tends to perform well by setting to 10 across all datasets.
Visualization of Clustering Results
Conclusion
In this paper, we propose a novel neural networkbased clustering method termed Deep Fusion Clustering Network (DFCN). In our method, the core component SAIF module leverages both graph structure and node attributes via a dynamic crossmodality fusion mechanism and a triplet selfsupervised strategy. In this way, more consensus and discriminative information from both sides is encoded to construct the robust target distribution, which effectively provides the precise network training guidance. Moreover, the proposed IGAE is able to assist in improving the generalization capability of the proposed method. Experiments on six benchmark datasets show that DFCN consistently outperforms stateoftheart baseline methods. In the future, we plan to further improve our method to adapt it to multiview graph clustering and incomplete multiview graph clustering applications.
Acknowledgments
This work is supported by the National Key R D Program of China (Grant 2018YFB1800202, 2020AAA0107100, 2020YFC2003400), the National Natural Science Foundation of China (Grant 61762033, 62006237, 62072465), the Hainan Province Key R D Plan Project (Grant ZDYF2020040), the Hainan Provincial Natural Science Foundation of China (Grant 2019RC041, 2019RC098), and the Opening Project of Shanghai Trusted Industrial Control Platform (Grant TICPSH202003005ZC).
References
 Spectral clustering with graph neural networks for graph pooling. In ICML, pp. 2729–2738. Cited by: Attributed Graph Clustering.
 Structural deep clustering network. In WWW, pp. 1400–1410. Cited by: Introduction, Attributed Graph Clustering, Target Distribution Generation, Benchmark Datasets, Parameters Setting, Comparison with the Stateoftheart Methods.
 Unsupervised clustering of quantitative imaging phenotypes using autoencoder and gaussian mixture model. In MICCAI, pp. 575–582. Cited by: Introduction.
 Multiview attribute graph convolution networks for clustering. In IJCAI, pp. 2973–2979. Cited by: Attributed Graph Clustering.
 One2Multi graph autoencoder for multiview graph clustering. In WWW, pp. 3070–3076. Cited by: Attributed Graph Clustering.
 Dual attention network for scene segmentation. In CVPR, pp. 3146–3154. Cited by: Crossmodality Dynamic Fusion Mechanism..
 Balanced selfpaced learning for generative adversarial clustering network. In CVPR, pp. 4391–4400. Cited by: Introduction.
 Improved deep embedded clustering with local structure preservation. In IJCAI, pp. 1753–1759. Cited by: Introduction, Target Distribution Generation, Comparison with the Stateoftheart Methods.
 A kmeans clustering algorithm. Applied Stats 28 (1), pp. 100–108. Cited by: Comparison with the Stateoftheart Methods.
 Reducing the dimensionality of data with neural networks. Science 313, pp. 504–507. Cited by: Comparison with the Stateoftheart Methods.
 Deep graph clustering in social network. In WWW, pp. 1425–1426. Cited by: Introduction.
 Deep subspace clustering networks. In NIPS, pp. 24–33. Cited by: Introduction.
 Semisupervised classification with graph convolutional networks. In ICLR, pp. 14. Cited by: Attributed Graph Clustering.
 Variational graph autoencoders. ArXiv abs/1611.07308, pp. . Cited by: Attributed Graph Clustering, Fusionbased Autoencoders, Comparison with the Stateoftheart Methods.
 Handwritten zip code recognition with multilayer networks. In ICPR, pp. 36–40. Cited by: Benchmark Datasets.
 RCV1: a new benchmark collection for text categorization research. Journal of Machine Learning Research 5 (2), pp. 361–397. Cited by: Benchmark Datasets.
 Deep adversarial multiview clustering network. In IJCAI, pp. 2952–2958. Cited by: Target Distribution Generation.
 Absent multiple kernel learning algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence 42 (6), pp. 1303–1316. Cited by: Evaluation Metric.
 Late fusion incomplete multiview clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (10), pp. 2410–2423. Cited by: Evaluation Metric.
 Multiple kernel kmeans with incomplete kernels. IEEE Transactions on Pattern Analysis and Machine Intelligence 42 (5), pp. 1191–1204. Cited by: Evaluation Metric.
 Matching theory. Cited by: Evaluation Metric.
 Visualizing data using tsne. Journal of Machine Learning Research 9 (2605), pp. 2579–2605. Cited by: Visualization of Clustering Results.
 Graph embedded pose clustering for anomaly detection. In CVPR, pp. 10536–10544. Cited by: Introduction.
 ClusterGAN: latent space clustering in generative adversarial networks. In AAAI, pp. 1965–1972. Cited by: Introduction.
 Learning graph embedding with adversarial training methods. IEEE Transactions on Cybernetics 50 (6), pp. 2475–2487. Cited by: Introduction, Attributed Graph Clustering, Parameters Setting, Comparison with the Stateoftheart Methods.
 Cascade subspace clusterings. In AAAI, pp. 2478–2484. Cited by: Introduction.
 Semisupervised deep embedded clustering. Neurocomputing 325 (1), pp. 121–130. Cited by: Target Distribution Generation.
 SpectralNet: spectral clustering using deep neural networks. In ICLR, pp. . Cited by: Introduction.
 Smart devices are different: assessing and mitigating mobile sensing heterogeneities for activity recognition. In SENSYS, pp. 127–140. Cited by: Benchmark Datasets.
 Multistage selfsupervised learning for graph convolutional networks on graphs with few labeled nodes. In AAAI, pp. 5892–5899. Cited by: Attributed Graph Clustering.
 Adversarial graph embedding for ensemble clustering. In IJCAI, pp. 3562–3568. Cited by: Attributed Graph Clustering, Fusionbased Autoencoders.
 Attributed graph clustering: a deep attentional embedding approach. In IJCAI, pp. 3670–3676. Cited by: Introduction, Attributed Graph Clustering, Fusionbased Autoencoders, Comparison with the Stateoftheart Methods.
 Linkage based face clustering via graph convolution network. In CVPR, pp. 1117–1125. Cited by: Introduction.

Unsupervised deep embedding for clustering analysis
. In ICML, pp. 478–487. Cited by: Introduction, Target Distribution Generation, Comparison with the Stateoftheart Methods.  Adversarial incomplete multiview clustering. In IJCAI, pp. 3933–3939. Cited by: Target Distribution Generation.
 Deep clustering by gaussian mixture variational autoencoders with graph embedding. In ICCV, pp. 6440–6449. Cited by: Introduction.
 Deep spectral clustering using dual autoencoder network. In CVPR, pp. 4066–4075. Cited by: Introduction.
 Latent distribution preserving deep subspace clustering. In IJCAI, pp. 4440–4446. Cited by: Introduction.
 Multiple kernel clustering with neighborkernel subspace segmentation. IEEE transactions on neural networks and learning systems 31 (4), pp. 1351–1362. Cited by: Evaluation Metric.
 Subspace segmentationbased robust multiple kernel clustering. Information Fusion 53, pp. 145–154. Cited by: Evaluation Metric.