1 Introduction
Clustering, which intends to group data points without any prior information, is one of the most fundamental tasks in machine learning. As well as the wellknown kmeans, graph based clustering
(Ng et al., 2002; Nie et al., 2014; Zhang et al., 2018)is also a representative kind of clustering method. A typical graph based clustering method includes two step: 1) construct graph by some algorithm; 2) divide samples into different clusters according to the constructed graph. For example, a classical spectral clustering method first builds a weighted adjacency matrix via
nearest neighbors and Gaussian kernel, and then attempts to find a good graph cut to split the graph into multiple connected component. Accordingly, it is not hard to understand that the performance of these clustering methods depend severely on the quality of constructed graph. Graph based clustering methods are widely used since they can capture manifold information so that they are available for the nonEuclidean type data, which is not provided by kmeans. Due to the success of deep learning, how to combine neural networks and traditional clustering model has been studied a lot
(Shaham et al., 2018; Xie et al., 2016; Zhang et al., 2019). In particular, the autoencoder (Hinton & Salakhutdinov, 2006) is the base framework of most deep clustering approaches.Network embedding (also known as graph embedding) is an attractive task for graph type data such as recommendation systems, social networks, etc
. The goal is to map nodes of a given graph into latent features (namely embedding) so that the learned embedding can be utilized on node classification, node clustering and link prediction. Roughly speaking, the network embedding approaches can be classified into 2 categories: generative model
(Wang et al., 2018; Perozzi et al., 2014; Grover & Leskovec, 2016) and discriminative model (Cao et al., 2016; Wang et al., 2016). The former tries to model a connectivity distribution for each node while the latter learns to distinguish whether edge exists between two nodes directly. Nevertheless, there is no method that aims at integrates generative models with clustering.In recent years, graph neural networks (Scarselli et al., 2008), especially graph convolution neural networks (GCN), have attracted a mass of attentions due to the success made in neural networks area. GNNs extend classical neural networks to irregular data so that the deep information hidden in graph is exploited sufficiently. In this paper, we only focus on GCNs and its variants. Different from traditional CNNs, a key issue in GCN is how to define convolution operator on irregular data. An intuitive approach is to constrain neighbors of each node to perform convolution, which is frequently known as spatial method (Niepert et al., 2016)
. On the contrary, the spectral operator first defines frequency domain of graph data and then performs convolution with the help of convolution theorem. GCNs have shown superiority compared with traditional graph embedding models. Similarly, graph autoencoder
(Kipf & Welling, 2016)is developed to extend GCN into unsupervised learning.
However, the existing methods are limited to graph type data (e.g., social networks, citation networks, etc.). Since a large proportion of clustering methods are based on graph like spectral clustering, it is reasonable to consider how to employ GCN to promote the performance of graph based clustering methods.
In this paper, we propose Adaptive Graph AutoEncoder, a novel clustering model for general data clustering, to extend graph autoencoder to common scenarios. The main contributions are listed as follow

To build a desirable graph, our model incorporates generative model of network embedding. Moreover, the learned connectivity distribution is also used as the goal that graph autoencoder aims to reconstruct.

Our model updates the graph adaptively according to the generated embedding so that it can exploit the deep information and revise the poor graph caused by raw features.

Our models also employs a manifold regularization to preserve the local structure, which can be regarded as a pseudosupervised information.
2 Preliminary and Related Work
We first introduce studies about graph convolution networks and graph autoencoder. Generative graph representation models in network embedding are demonstrated roughly due to the connection between them and our model. Due to the limitation of space, classical clustering models (e.g., spectral clustering) are omitted.
2.1 Notations
In this paper, matrices and vectors are represented by uppercase and lowercase letters respectively. For a square matrix
, is the trace and denotes the transpose of . A graph is represented as and is the size of some set. Vectors whose all elements are 1 is represented as 1. If , then ; otherwise, . For every node , it is represented by a dimension vector and thus, can be also denoted by . The amount of clusters is represented as .2.2 Graph AutoEncoder
In recent years, graph convolution networks (GCN) have been studied a lot to extend neural networks to graph type data. How to design graph convolution operator is a key issue and has attracted a mass of attentions. Most of them can be classified into 2 categories, spectral methods (Niepert et al., 2016) and spatial methods(Bruna et al., 2013) . In this paper, we focus on a simple but widely used convolution operator (Kipf & Welling, 2017), which can be regarded as both spectral operator and spatial operator. Formally, if the input of a graph convolution layer is and the adjacency matrix is , then the output is defined as
(1) 
where
is some activation function,
, and denotes the degree matrix (). It should be pointed out that can be regarded as a processed graph with selfloop for each node and is the normalized adjacency matrix. More importantly, is equivalent to compute weighted means for each node with its firstorder neighbors from the spatial aspect. To improve the performance, MixHop (AbuElHaija et al., 2019) aims to mix information from different order neighbors and SGC (Wu et al., 2019) tries to utilize higherorder neighbors. The capacity is also proved to some extent (Xu et al., 2019). GCN and its variants are usually used on semisupervised learning. Besides, since training of each GCN layer needs all data to finish a complete propagation, several models are proposed to speed it up
(Chen et al., 2018; Chiang et al., 2019).To apply graph convolution on unsupervised learning, graph autoencoder (GAE) is proposed (Kipf & Welling, 2016). GAE firstly transforms each node into latent representation (also named as embedding), which is similar with GCN, and then aims to reconstruct some part of input. GAEs proposed in (Kipf & Welling, 2016; Pan et al., 2018; Wang et al., 2019) intend to reconstruct the adjacency via decoder while GAEs developed in (Wang et al., 2017; Park et al., 2019) attempt to reconstruct the content. The difference is which extra mechanism (such as attention, adversarial learning, graph sharpness, etc.) is used.
2.3 Generative Graph Embedding
Graph representation learning is to transform nodes of graph into vector representation so that it can be employed to perform node classification, link prediction and so on. Similar with generative model in classical supervised learning, the core assumption of the generative model in network embedding is that there exists a underlying connectivity distribution and all edges in graph are sampled according to this true distribution. Therefore, the generative model intends to approximates the potential distribution via latent variables, the learned embedding. In recent years, several deep generative models are developed such as DeepWalk (Perozzi et al., 2014), Node2vec (Grover & Leskovec, 2016) and GraphGAN (Wang et al., 2018).
3 Proposed Model
In this section, we will show the proposed model, Adaptive Graph AutoEncoder (AdaGAE) for general data clustering. The core idea is illustrated in Figure 1.
3.1 Probabilistic Perspective of Weighted Graph
Let denote connectivity distribution of node
. To satisfy the basic property of probability distribution, we have
.In general clustering scenario, edges frequently do not exist. Edges and weights need to be constructed via some scheme. Since , the distribution can be viewed as valid weights. Note that has not to hold, and therefore, the constructed graph should be viewed as a directed graph. From this probabilistic perspective, given distances among samples , we expect that
(2) 
so that the constructed graph is locally coherent. However, it is impracticable to solve the above problem directly, as it has a trivial solution: and if . A universal method is to employ Regularization Loss Minimization, and the objective can be stated as
(3) 
where is some regularization term.
In most practical situations, although global distance is usually unreliable, local distance is regarded as a vital part in manifold learning. Similarly, if data is modeled as a graph and the Euclidean distance is used as the measurement of similarities among data points, an ideal distribution should be sparse. More formally, let and the sparse distribution should satisfy that where represents a small constant. Hence, the regularization term should be . Nevertheless, norm is nonconvex and it is NPhard to solve. Generally, we try to solve a convex relaxation problem, i.e., , since norm is the tightest convex relaxation of norm and it guarantees the sparseness of solution. However, the nondifferentiable problem is hard to optimize and the sparsity degree cannot be controlled. For this problem, we will show that the norm relaxation can provide steerable sparsity, which can be solved by analytic solution. To control sparsity of distribution for each node, we utilize pointwise regularization.
Theorem 1.
The norm relaxation of problem (3)
(4) 
has a sparse solution if satisfies
(5) 
where denotes the th smallest value of .
The pointwise regularization can control sparsity of each node but also increase the amount of hyperparameters. In this paper, we simply choose unified sparsity for all nodes. Formally, is set as the upper bound
(6) 
It should be emphasized that there is only one hyperparameter, sparsity , in our model, which is much easier to tune than the one in traditional relaxation method. In particular, problem (4) can be solved analytically. The concrete derivation to solve problem (4) and the proof of Theorem 1 will be stated in Section 4.1.
3.2 Graph AutoEncoder for Weighted Graph
After getting connectivity distribution by solving problem (4), we transfer the directed graph to an undirected graph via , and the connectivity distribution serves as the reconstruction goal of graph autoencoder, which will be elaborated soon.
Encoder
As shown in (Kipf & Welling, 2017), graphs with selfloop show better performance, i.e., . Due to , if . Moreover, the weight of selfloop is learned adaptively rather than primitive . Consequently, we can simply set and . The encoder consists of multiple GCN layers and aims to transform raw features to latent features with the constructed graph structure. Specifically speaking, the latent feature is defined as
(7) 
Decoder
Instead of reconstructing the weight matrix , we aim to recover the connectivity distribution . Firstly, distances of latent features are calculated by . Secondly, the connectivity distribution is reconstructed by a normalization step
(8) 
The above process can be regarded as inputting
into a softmax layer. Clearly, as
is smaller, is larger. In other words, the similarity is measured by Euclidean distances rather than innerproducts used in GAE. To measure difference between two distributions, KullbackLeibler (KL) Divergence is conventionally utilized. Consequently, the objective function is defined as(9) 
Note that the second line is aiming to minimize the cross entropy, which is widely employed in classification tasks.
Local Structure Preserving
A primary drawback of autoencoder is that there may exist diverse latent representation schemes that can be decoded to the input of encoder due to the powerful representation capacity of neural networks. Nevertheless, some kinds of representations may be useless even harmful. To break this restriction, a popular method is to introduce some prior information such as adversarial autoencoder (Makhzani et al., 2015) and variational autoencoder (Kingma & Welling, 2014). Since the similarities are measured by distances and local information is often credible (especially in manifold learning), we add a local structure preserving penalty term into Eq. (9) and thus, the cost function is defined as
(10) 
where and is a tradeoff parameter to balance cross entropy term and local consistency penalty term.
3.3 Adaptive Graph AutoEncoder
In last subsection, the weighted adjacency matrix is regarded as fixed during training phase. However, the weighted adjacency matrix is computed through Eq. (4). The whole clustering process should contain connectivity learning and hence, the weighted adjacency should be updated adaptively during training. An intuitive approach is to recompute the connectivity distribution based on the embedding , which contains potential manifold structure information of data. However, the following theorem shows that the simple update based on latent representations may lead to collapse.
Theorem 2.
Denote where is generated by GAE with sparsity . If approximates well (numerically) then the solution of
(11) 
, with the same sparsity, degenerates into an unweighted adjacency matrix.
Intuitively, the unweighted graph is indeed a bad choice for classical clustering tasks. Therefore, the update step with the same sparsity coefficient may result in collapse. To address this problem, we assume that
Assumption 1.
Suppose that the sparse and weighted adjacency is good enough. Specifically, weights of edges are large if it is within a cluster, or small otherwise. Then, under latent representation, samples belonging to the same cluster become more cohesive measured by Euclidean distance.
According to the above assumption, samples from a cluster are more likely to lie in a local area after GAE mapping. Hence, the sparsity coefficient increases when updating weight sparsity. The step size which increases with needs to be discussed. In an ideal situation, we can define the upper bound of as
(12) 
where denotes the th cluster. Although is not known, we can define empirically to ensure . For instance, can be set as or . Accordingly, the step size where is the number of iterations to update the weight adjacency.
To sum up, Algorithm 1 summarize the whole process of Adaptive Graph AutoEncoder (AdaGAE).
4 Optimization and Theoretical Analysis
In this section, we first show how to solve problem (4) analytically. Then proofs of the two mentioned theorems are elaborated respectively.
4.1 Optimization of Problem (4)
To keep notations uncluttered, is simplified as . Then the problem (4) is equivalent to solve the following subproblem individually
(13) 
To keep the discussion more concise, the subscript is neglected. Due to is constant, we have
(14) 
Then the Lagrangian of the above equation is
(15) 
where and are Lagrangian multipliers. According to KKT conditions,
(16) 
It is not hard to verify that
(17) 
where . Without loss of generality, suppose that . According to Theorem 1, , or equivalently, where . Due to , we have
(18) 
Substitute Eq. (6) into Eq. (18) and we have
(19) 
If , then it is not hard to verify that Eq. (19) is also the optima. Accordingly, the connectivity distribution can be calculated via closeform solution.
4.2 Proof of Theorem 1
To keep notations uncluttered, we make the same assumption, .
Proof.
To begin with, the auxiliary function
(20) 
is nondecreasing with . Note that is strictly increasing when . If , there must exist that satisfies . In this case, is sparse. According to Eq. (18), we have
(21) 
When satisfies Eq. (5), we have
(22) 
according to the nondecreasing property of the auxiliary function . If , then
(23) 
Hence, which leads to contradiction. If
(24) 
In the first inequality, the equality will never hold due to . Accordingly, is at least sparse, which lead to contradiction as well. Therefore, we have .
When , it is not hart to verify that which also leads to contradiction. Finally, will never hold due to the constraint .
In sum, the theorem is proved. ∎
4.3 Analysis of Degeneration
In this part, we will first prove Theorem 2 and then explains the phenomenon from a different perspective. Now, we give the proof of Theorem 2.
Proof.
Consider the connectivity distribution of and suppose that . When approximates 0 numerically, we have
(25) 
If we update by Eq. (19) based on , then for any ,
(26) 
due to . The proof is easy to extend to other nodes. Hence, the theorem is proved. ∎
On the other hand, the following theorem demonstrates that the SoftMax output layer with is equivalent to solve problem (3) with a totally different regularization. Therefore, the perfect approximation may lead to bad performance.
Theorem 3.
The decoder of AdaGAE is equivalent to solve the following problem
(27) 
where represents the entropy of connectivity distribution of node .
The proof is stated in supplementary material.
4.4 Spectral Analysis
As mentioned in subsection 3.2, AdaGAE generates a weighted graph with adaptive selfloops. Analogous to SGC (Wu et al., 2019), adaptive selfloops also reduce the spectrum of normalized Laplacian. Formally,
Theorem 4.
Let and
. According to eigenvalue decomposition, suppose
and . The following inequality always holds(28) 
The proof is stated in supplementary material. This theorem indicates the adaptive selfloops smooth the Laplacian matrix as well.
Methods  UMIST  JAFFE  ORL  PALM  COIL20  YALE  USPS 

SC  29.01  70.14  56.43  19.54  56.42  29.58  28.75 
KMeans  42.87  72.39  54.75  70.39  58.26  41.33  64.67 
FKSC  50.64  79.58  69.33  75.30  69.78  50.96  
CAN  69.62  96.71  68.00  88.10  84.10  49.09  67.96 
DFKM  45.47  90.83  61.43  67.45  60.21  51.43  73.42 
DEC  36.47  62.95  27.45  74.35  50.42  42.30  71.22 
GAE (fixed )  73.22  96.71  71.75  88.30  92.43  57.58  67.48 
AdaGAE (fixed sparsity)  32.00  47.42  68.00  91.80  33.82  57.58  34.09 
AdaGAE  83.48  97.27  71.40  95.25  93.75  57.58  91.96 
Methods  UMIST  JAFFE  ORL  PALM  COIL20  YALE  USPS 

SC  30.77  77.94  75.69  37.25  71.06  34.78  21.54 
KMeans  65.47  80.90  75.43  89.98  74.58  48.71  62.88 
FKSC  67.67  84.24  83.32  93.28  80.75  50.38  
CAN  87.75  96.39  83.58  97.08  90.93  56.18  78.85 
DFKM  67.04  92.01  81.13  86.74  76.81  55.92  71.58 
DEC  56.96  82.83  55.22  90.37  69.43  48.71  72.25 
GAE (fixed )  87.04  96.39  84.51  96.96  97.26  76.93  76.45 
AdaGAE (fixed sparsity)  52.08  59.55  83.65  97.80  55.46  76.93  35.39 
AdaGAE  91.03  96.78  85.34  98.18  98.36  76.93  84.81 
tSNE visualization on UMIST and USPS: The first line illustrates results on UMIST and the second line shows results on USPS. Clearly, AdaGAE projects semblable samples into the same embedding. Notice that a few data points are projected into wrong group which are usually regarded as outliers.
5 Experiments
In this section, details of AdaGAE are demonstrated and the results are shown. The visualization supports theoretical analysis mentioned in the last section.
5.1 Datasets
AdaGAE are evaluated on totally 7 datasets, including UMIST (Hou et al., 2013), JAFFE (Lyons et al., 1999), ORL (Cai et al., 2010), PALM, COIL20 (Nene et al., 1996), YALE (Georghiades et al., 2001) and USPS (Hull, 1994). The concrete information is summarized in Table 3. All features are rescaled to . As how to apply GCN on large graph is still a challenging problem and it is not the focus of this paper, we only conduct experiments on small and middle scale datasets. A feasible method is to replace the encoder with GCN designed for large scale datasets, such as StoGCN (Chen et al., 2018), ClusterGCN (Chiang et al., 2019), etc.
Name  # Features  # Size  # Classes 

UMIST  1024  575  20 
JAFFE  1024  213  10 
ORL  1024  400  40 
PALM  256  2000  100 
COIL20  1024  1440  20 
YALE  1024  165  15 
USPS  256  9298  10 
5.2 Compared Methods
To evaluate AdaGAE, totally 6 methods are compared, including Spectral Clustering (SC) (Ng et al., 2002), KMeans, FKSC (Zhang et al., 2018), CAN (Nie et al., 2014), DFKM (Zhang et al., 2019) and DEC (Xie et al., 2016). Roughly speaking, SC, KMeans, FKSC and CAN are traditional clustering approaches while DFKM and DEC are deep clustering models with different embedded clustering models. Hyperparameters of these methods are searched via the same pattern recorded in the corresponding papers. Codes of these methods are downloaded from homepages of authors.
5.3 Experimental Setup
In our experiments, the encoder consists of two GCN layers. If the input dimension is 1024, the first layer has 256 neurons and the second layer has 64 neurons. Otherwise, the two layers have 128 neurons and 64 neurons respectively. The activation function of the first layer is set as ReLU while the other one employs linear function. The initial sparsity
is set as 5 and the upper bound is searched from . The tradeoff coefficient is searched from . The number of graph update step is set as 10 and the maximum iterations to optimize GAE varies in .To verify the effect of the adaptive process, two extra experiments are conducted: GAE with fixed and AdaGAE with fixed sparsity . Note that all hyperparameters are same except for the specific setting.
Two popular clustering metrics, accuracy (ACC) and normalized mutual information (NMI
), are employed to evaluate performance. All methods are run 10 times and the means are reported. The code is implemented under pytorch1.3.1 on a Windows 10 PC with a NVIDIA GeForce GTX 1660 GPU and 8 i7 cores. The exact values of hyperparameters can be found in supplementary material.
5.4 Experimental Results
ACC and NMI of all mentioned methods are summarized in Table 1 and Table 2, respectively. The best results of both competitors and AdaGAEs are highlighted in boldface. From Table 1 and Table 2, we conclude that:

Classical deep clustering models suffer from overfitting and work poorly on small scale datasets while AdaGAE is stable on all datasets. Moreover, the improvement caused by graph convolution is impressive. Specifically, AdaGAE outperform DFKM about 20% on ACC and 7% on NMI for USPS.

When the sparsity keeps fixed, AdaGAE collapses on UMIST, JAFFE, COIL20 and USPS. For example, ACC shrinks about 50% and NMI shrinks about on COIL20.

From the comparison of two extra experiments, we confirm that the adaptive graph update process plays a positive role on most datasets except for ORL, which may be caused by too few samples in each cluster such that it is hard to define and .
Besides, Figure 2 illustrates the learned embedding vividly. Combining with Theorem 2, if is fixed as a constant, then degenerates into an unweighted adjacency matrix and a cluster is broken into a mass of groups. Each group only contains a small amount of data points and they scatter chaotically which leads to collapse. Instead, the adaptive process introduced in Section 3.3 connects these groups before degeneration via increasing sparsity and hence, the embeddings in a cluster become pretty cohesive. It should be emphasized that a large frequently leads to capture wrong information. After transformation of GAE, the nearest neighbors are more likely to belong with a same cluster and thus it is rational to increasing with an adequate step size.
6 Conclusion
In this paper, we propose a novel clustering model for general data clustering, namely Adaptive Graph AutoEncoder (AdaGAE). Generative graph representation model is utilized to construct a weighted graph with steerable sparsity. To exploit potential information of data, we employ graph convolution operator and thus a graph autoencoder with local structure preserving is designed. More importantly, as the graph used in GAE is constructed artificially, an adaptive update step is developed to update graph with the help of learned embedding. Related theoretical analysis demonstrates the reason why AdaGAE with fixed sparsity collapses in update step. In experiments, we verify the effectiveness of adaptive update step by removing the corresponding part of AdaGAE. Surprisingly, the visualization supports the theoretical analysis well and confirms the necessary of the adaptive graph update.
References
 AbuElHaija et al. (2019) AbuElHaija, S., Perozzi, B., Kapoor, A., Alipourfard, N., Lerman, K., Harutyunyan, H., Ver Steeg, G., and Galstyan, A. Mixhop: Higherorder graph convolutional architectures via sparsified neighborhood mixing. In International Conference on Machine Learning, pp. 21–29, 2019.
 Bruna et al. (2013) Bruna, J., Zaremba, W., Szlam, A., and LeCun, Y. Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203, 2013.

Cai et al. (2010)
Cai, D., Zhang, C., and He, X.
Unsupervised feature selection for multicluster data.
In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 333–342, 2010. 
Cao et al. (2016)
Cao, S., Lu, W., and Xu, Q.
Deep neural networks for learning graph representations.
In
Thirtieth AAAI conference on artificial intelligence
, 2016. 
Chen et al. (2018)
Chen, J., Zhu, J., and Song, L.
Stochastic training of graph convolutional networks with variance reduction.
In International Conference on Machine Learning, pp. 942–950, 2018.  Chiang et al. (2019) Chiang, W.L., Liu, X., Si, S., Li, Y., Bengio, S., and Hsieh, C.J. Clustergcn: An efficient algorithm for training deep and large graph convolutional networks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 257–266, 2019.

Georghiades et al. (2001)
Georghiades, A., Belhumeur, P., and Kriegman, D.
From few to many: Illumination cone models for face recognition under variable lighting and pose.
IEEE Transactions on Pattern Analysis & Machine Intelligence, (6):643–660, 2001.  Grover & Leskovec (2016) Grover, A. and Leskovec, J. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 855–864, 2016.
 Hinton & Salakhutdinov (2006) Hinton, G. E. and Salakhutdinov, R. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006.
 Hou et al. (2013) Hou, C., Nie, F., Li, X., Yi, D., and Wu, Y. Joint embedding learning and sparse regression: A framework for unsupervised feature selection. IEEE Transactions on Cybernetics, 44(6):793–804, 2013.
 Hull (1994) Hull, J. A database for handwritten text recognition research. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(5):550–554, 1994.
 Kingma & Welling (2014) Kingma, D. P. and Welling, M. Autoencoding variational bayes. In ICLR, 2014.
 Kipf & Welling (2016) Kipf, T. N. and Welling, M. Variational graph autoencoders. arXiv preprint arXiv:1611.07308, 2016.
 Kipf & Welling (2017) Kipf, T. N. and Welling, M. Semisupervised classification with graph convolutional networks. In ICLR, 2017.
 Lyons et al. (1999) Lyons, M., Budynek, J., and Akamatsu, S. Automatic classification of single facial images. IEEE transactions on pattern analysis and machine intelligence, 21(12):1357–1362, 1999.
 Makhzani et al. (2015) Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I., and Frey, B. Adversarial autoencoders. arXiv preprint arXiv:1511.05644, 2015.
 Nene et al. (1996) Nene, S., Nayar, S., and Murase, H. Columbia object image library (coil20). 1996.

Ng et al. (2002)
Ng, A. Y., Jordan, M. I., and Weiss, Y.
On spectral clustering: Analysis and an algorithm.
In Advances in neural information processing systems, pp. 849–856, 2002.  Nie et al. (2014) Nie, F., Wang, X., and Huang, H. Clustering and projected clustering with adaptive neighbors. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 977–986. ACM, 2014.
 Niepert et al. (2016) Niepert, M., Ahmed, M., and Kutzkov, K. Learning convolutional neural networks for graphs. In International conference on machine learning, pp. 2014–2023, 2016.
 Pan et al. (2018) Pan, S., Hu, R., Long, G., Jiang, J., Yao, L., and Zhang, C. Adversarially regularized graph autoencoder for graph embedding. In IJCAI, pp. 2609–2615, 2018.

Park et al. (2019)
Park, J., Lee, M., Chang, H. J., Lee, K., and Choi, J. Y.
Symmetric graph convolutional autoencoder for unsupervised graph
representation learning.
In
Proceedings of the IEEE International Conference on Computer Vision
, pp. 6519–6528, 2019.  Perozzi et al. (2014) Perozzi, B., AlRfou, R., and Skiena, S. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 701–710. ACM, 2014.
 Scarselli et al. (2008) Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., and Monfardini, G. The graph neural network model. IEEE Transactions on Neural Networks, 20(1):61–80, 2008.
 Shaham et al. (2018) Shaham, U., Stanton, K., Li, H., Nadler, B., Basri, R., and Kluger, Y. Spectralnet: Spectral clustering using deep neural networks. arXiv preprint arXiv:1801.01587, 2018.
 Wang et al. (2017) Wang, C., Pan, S., Long, G., Zhu, X., and Jiang, J. Mgae: Marginalized graph autoencoder for graph clustering. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 889–898, 2017.
 Wang et al. (2019) Wang, C., Pan, S., Hu, R., Long, G., Jiang, J., and Zhang, C. Attributed graph clustering: a deep attentional embedding approach. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pp. 3670–3676. AAAI Press, 2019.
 Wang et al. (2016) Wang, D., Cui, P., and Zhu, W. Structural deep network embedding. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1225–1234, 2016.
 Wang et al. (2018) Wang, H., Wang, J., Wang, J., Zhao, M., Zhang, W., Zhang, F., Xie, X., and Guo, M. Graphgan: Graph representation learning with generative adversarial nets. In ThirtySecond AAAI Conference on Artificial Intelligence, pp. 2508–2515, 2018.
 Wu et al. (2019) Wu, F., Zhang, T., Holanda de Souza, A., Fifty, C., Yu, T., and Weinberger, K. Q. Simplifying graph convolutional networks. Proceedings of Machine Learning Research, 2019.
 Xie et al. (2016) Xie, J., Girshick, R., and Farhadi, A. Unsupervised deep embedding for clustering analysis. In International conference on machine learning, pp. 478–487, 2016.
 Xu et al. (2019) Xu, K., Hu, W., Leskovec, J., and Jegelka, S. How powerful are graph neural networks? In Proc. of ICLR, 2019.
 Zhang et al. (2018) Zhang, R., Nie, F., Guo, M., Wei, X., and Li, X. Joint learning of fuzzy kmeans and nonnegative spectral clustering with side information. IEEE Transactions on Image Processing, 28(5):2152–2162, 2018.
 Zhang et al. (2019) Zhang, R., Li, X., Zhang, H., and Nie, F. Deep fuzzy kmeans with adaptive loss and entropy regularization. IEEE Transactions on Fuzzy Systems, pp. 1–1, 2019. ISSN 19410034. doi: 10.1109/TFUZZ.2019.2945232.
Appendix A Proof of Theorem 3
Proof.
Problem (27) is equivalent to the following th subproblem
(29) 
Similarly, the subscript is omitted to keep notations uncluttered. The Lagrangian is
(30) 
Then the KKT conditions are
(31) 
Due to , . Use the first line, we have
(32) 
Combine it with the second line and we have
(33) 
Furthermore, we have
(34) 
With , the theorem is proved. ∎
Appendix B Proof of Theorem 4
It should be pointed out that the proof imitates the corresponding proof in (Wu et al., 2019). Analogous to Lemma 3 in (Wu et al., 2019), we first give the following lemma without proof,
Lemma 1.
Let be eigenvalues of and be eigenvalues of . The following inequality always holds
(35) 
The proof is apparent according to Lemma 3 provided in (Wu et al., 2019). The proof is given as follows
Proof.
Let and we have
∎
Dataset  

UMIST  1  5  50  10  
COIL20  1  5  100  10  
JAFFE  5  20  10  
PALM  10  50  10  
YALE  5  150  10  
ORL  5  150  10  
USPS  5  150  10  
Appendix C Experimental Details
The exact values of AdaGAE in our experiments are reported in Table 4. To use SC, we construct the graph via Gaussian kernel, which is given as
(36) 
where represents nearest neighbors of sample . is searched from and is searched from .
The maximum iterations of GAE with fixed is set as 200. Note that its objective function is defined as Eq. (10).
Codes of Kmeans, SC, FKSC, and CAN are implemented under MATLAB 2019a, while codes of DFKM, DEC and AdaGAE are implemented under python 3.6.
For all datasets, we simply rescale features into . All datasets are downloaded from http://www.escience.cn/people/fpnie/index.html.
Comments
There are no comments yet.