Adaptive Graph Auto-Encoder for General Data Clustering

02/20/2020 ∙ by Xuelong Li, et al. ∙ 17

Graph based clustering plays an important role in clustering area. Recent studies about graph convolution neural networks have achieved impressive success on graph type data. However, in traditional clustering tasks, the graph structure of data does not exist such that the strategy to construct graph is crucial for performance. In addition, the existing graph auto-encoder based approaches perform poorly on weighted graph, which is widely used in graph based clustering. In this paper, we propose a graph auto-encoder with local structure preserving for general data clustering, which can update the constructed graph adaptively. The adaptive process is designed to utilize the non-Euclidean structure sufficiently. By combining generative model for graph embedding and graph based clustering, a graph auto-encoder with a novel decoder is developed and it performs well in weighted graph used scenarios. Extensive experiments prove the superiority of our model.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Clustering, which intends to group data points without any prior information, is one of the most fundamental tasks in machine learning. As well as the well-known k-means, graph based clustering

(Ng et al., 2002; Nie et al., 2014; Zhang et al., 2018)

is also a representative kind of clustering method. A typical graph based clustering method includes two step: 1) construct graph by some algorithm; 2) divide samples into different clusters according to the constructed graph. For example, a classical spectral clustering method first builds a weighted adjacency matrix via

-nearest neighbors and Gaussian kernel, and then attempts to find a good graph cut to split the graph into multiple connected component. Accordingly, it is not hard to understand that the performance of these clustering methods depend severely on the quality of constructed graph. Graph based clustering methods are widely used since they can capture manifold information so that they are available for the non-Euclidean type data, which is not provided by k-means. Due to the success of deep learning, how to combine neural networks and traditional clustering model has been studied a lot

(Shaham et al., 2018; Xie et al., 2016; Zhang et al., 2019). In particular, the auto-encoder (Hinton & Salakhutdinov, 2006) is the base framework of most deep clustering approaches.

Network embedding (also known as graph embedding) is an attractive task for graph type data such as recommendation systems, social networks, etc

. The goal is to map nodes of a given graph into latent features (namely embedding) so that the learned embedding can be utilized on node classification, node clustering and link prediction. Roughly speaking, the network embedding approaches can be classified into 2 categories: generative model

(Wang et al., 2018; Perozzi et al., 2014; Grover & Leskovec, 2016) and discriminative model (Cao et al., 2016; Wang et al., 2016). The former tries to model a connectivity distribution for each node while the latter learns to distinguish whether edge exists between two nodes directly. Nevertheless, there is no method that aims at integrates generative models with clustering.

In recent years, graph neural networks (Scarselli et al., 2008), especially graph convolution neural networks (GCN), have attracted a mass of attentions due to the success made in neural networks area. GNNs extend classical neural networks to irregular data so that the deep information hidden in graph is exploited sufficiently. In this paper, we only focus on GCNs and its variants. Different from traditional CNNs, a key issue in GCN is how to define convolution operator on irregular data. An intuitive approach is to constrain neighbors of each node to perform convolution, which is frequently known as spatial method (Niepert et al., 2016)

. On the contrary, the spectral operator first defines frequency domain of graph data and then performs convolution with the help of convolution theorem. GCNs have shown superiority compared with traditional graph embedding models. Similarly, graph auto-encoder

(Kipf & Welling, 2016)

is developed to extend GCN into unsupervised learning.

However, the existing methods are limited to graph type data (e.g., social networks, citation networks, etc.). Since a large proportion of clustering methods are based on graph like spectral clustering, it is reasonable to consider how to employ GCN to promote the performance of graph based clustering methods.

Figure 1: Illustration of AdaGAE

In this paper, we propose Adaptive Graph Auto-Encoder, a novel clustering model for general data clustering, to extend graph auto-encoder to common scenarios. The main contributions are listed as follow

  • To build a desirable graph, our model incorporates generative model of network embedding. Moreover, the learned connectivity distribution is also used as the goal that graph auto-encoder aims to reconstruct.

  • Our model updates the graph adaptively according to the generated embedding so that it can exploit the deep information and revise the poor graph caused by raw features.

  • Our models also employs a manifold regularization to preserve the local structure, which can be regarded as a pseudo-supervised information.

2 Preliminary and Related Work

We first introduce studies about graph convolution networks and graph auto-encoder. Generative graph representation models in network embedding are demonstrated roughly due to the connection between them and our model. Due to the limitation of space, classical clustering models (e.g., spectral clustering) are omitted.

2.1 Notations

In this paper, matrices and vectors are represented by uppercase and lowercase letters respectively. For a square matrix

, is the trace and denotes the transpose of . A graph is represented as and is the size of some set. Vectors whose all elements are 1 is represented as 1. If , then ; otherwise, . For every node , it is represented by a -dimension vector and thus, can be also denoted by . The amount of clusters is represented as .

2.2 Graph Auto-Encoder

In recent years, graph convolution networks (GCN) have been studied a lot to extend neural networks to graph type data. How to design graph convolution operator is a key issue and has attracted a mass of attentions. Most of them can be classified into 2 categories, spectral methods (Niepert et al., 2016) and spatial methods(Bruna et al., 2013) . In this paper, we focus on a simple but widely used convolution operator (Kipf & Welling, 2017), which can be regarded as both spectral operator and spatial operator. Formally, if the input of a graph convolution layer is and the adjacency matrix is , then the output is defined as

(1)

where

is some activation function,

, and denotes the degree matrix (). It should be pointed out that can be regarded as a processed graph with self-loop for each node and is the normalized adjacency matrix. More importantly, is equivalent to compute weighted means for each node with its first-order neighbors from the spatial aspect. To improve the performance, MixHop (Abu-El-Haija et al., 2019) aims to mix information from different order neighbors and SGC (Wu et al., 2019) tries to utilize higher-order neighbors. The capacity is also proved to some extent (Xu et al., 2019)

. GCN and its variants are usually used on semi-supervised learning. Besides, since training of each GCN layer needs all data to finish a complete propagation, several models are proposed to speed it up

(Chen et al., 2018; Chiang et al., 2019).

To apply graph convolution on unsupervised learning, graph auto-encoder (GAE) is proposed (Kipf & Welling, 2016). GAE firstly transforms each node into latent representation (also named as embedding), which is similar with GCN, and then aims to reconstruct some part of input. GAEs proposed in (Kipf & Welling, 2016; Pan et al., 2018; Wang et al., 2019) intend to reconstruct the adjacency via decoder while GAEs developed in (Wang et al., 2017; Park et al., 2019) attempt to reconstruct the content. The difference is which extra mechanism (such as attention, adversarial learning, graph sharpness, etc.) is used.

2.3 Generative Graph Embedding

Graph representation learning is to transform nodes of graph into vector representation so that it can be employed to perform node classification, link prediction and so on. Similar with generative model in classical supervised learning, the core assumption of the generative model in network embedding is that there exists a underlying connectivity distribution and all edges in graph are sampled according to this true distribution. Therefore, the generative model intends to approximates the potential distribution via latent variables, the learned embedding. In recent years, several deep generative models are developed such as DeepWalk (Perozzi et al., 2014), Node2vec (Grover & Leskovec, 2016) and GraphGAN (Wang et al., 2018).

3 Proposed Model

In this section, we will show the proposed model, Adaptive Graph Auto-Encoder (AdaGAE) for general data clustering. The core idea is illustrated in Figure 1.

3.1 Probabilistic Perspective of Weighted Graph

Let denote connectivity distribution of node

. To satisfy the basic property of probability distribution, we have

.

In general clustering scenario, edges frequently do not exist. Edges and weights need to be constructed via some scheme. Since , the distribution can be viewed as valid weights. Note that has not to hold, and therefore, the constructed graph should be viewed as a directed graph. From this probabilistic perspective, given distances among samples , we expect that

(2)

so that the constructed graph is locally coherent. However, it is impracticable to solve the above problem directly, as it has a trivial solution: and if . A universal method is to employ Regularization Loss Minimization, and the objective can be stated as

(3)

where is some regularization term.

In most practical situations, although global distance is usually unreliable, local distance is regarded as a vital part in manifold learning. Similarly, if data is modeled as a graph and the Euclidean distance is used as the measurement of similarities among data points, an ideal distribution should be sparse. More formally, let and the sparse distribution should satisfy that where represents a small constant. Hence, the regularization term should be . Nevertheless, -norm is non-convex and it is NP-hard to solve. Generally, we try to solve a convex relaxation problem, i.e., , since -norm is the tightest convex relaxation of -norm and it guarantees the sparseness of solution. However, the nondifferentiable problem is hard to optimize and the sparsity degree cannot be controlled. For this problem, we will show that the -norm relaxation can provide steerable sparsity, which can be solved by analytic solution. To control sparsity of distribution for each node, we utilize point-wise regularization.

Theorem 1.

The -norm relaxation of problem (3)

(4)

has a -sparse solution if satisfies

(5)

where denotes the -th smallest value of .

The point-wise regularization can control sparsity of each node but also increase the amount of hyper-parameters. In this paper, we simply choose unified sparsity for all nodes. Formally, is set as the upper bound

(6)

It should be emphasized that there is only one hyper-parameter, sparsity , in our model, which is much easier to tune than the one in traditional relaxation method. In particular, problem (4) can be solved analytically. The concrete derivation to solve problem (4) and the proof of Theorem 1 will be stated in Section 4.1.

3.2 Graph Auto-Encoder for Weighted Graph

After getting connectivity distribution by solving problem (4), we transfer the directed graph to an undirected graph via , and the connectivity distribution serves as the reconstruction goal of graph auto-encoder, which will be elaborated soon.

Encoder

As shown in (Kipf & Welling, 2017), graphs with self-loop show better performance, i.e., . Due to , if . Moreover, the weight of self-loop is learned adaptively rather than primitive . Consequently, we can simply set and . The encoder consists of multiple GCN layers and aims to transform raw features to latent features with the constructed graph structure. Specifically speaking, the latent feature is defined as

(7)

Decoder

Instead of reconstructing the weight matrix , we aim to recover the connectivity distribution . Firstly, distances of latent features are calculated by . Secondly, the connectivity distribution is reconstructed by a normalization step

(8)

The above process can be regarded as inputting

into a softmax layer. Clearly, as

is smaller, is larger. In other words, the similarity is measured by Euclidean distances rather than inner-products used in GAE. To measure difference between two distributions, Kullback-Leibler (KL) Divergence is conventionally utilized. Consequently, the objective function is defined as

(9)

Note that the second line is aiming to minimize the cross entropy, which is widely employed in classification tasks.

Local Structure Preserving

A primary drawback of auto-encoder is that there may exist diverse latent representation schemes that can be decoded to the input of encoder due to the powerful representation capacity of neural networks. Nevertheless, some kinds of representations may be useless even harmful. To break this restriction, a popular method is to introduce some prior information such as adversarial auto-encoder (Makhzani et al., 2015) and variational auto-encoder (Kingma & Welling, 2014). Since the similarities are measured by distances and local information is often credible (especially in manifold learning), we add a local structure preserving penalty term into Eq. (9) and thus, the cost function is defined as

(10)

where and is a tradeoff parameter to balance cross entropy term and local consistency penalty term.

0:  Initial sparsity , upper bound of sparsity and number of iterations to update weight adjacency .
  ,
  for  do
     Compute via Eq. (6).
     Compute .
     Compute and by solving problem (4) with .
     repeat
        Update GAE with Eq. (10) by gradient descent.
     until convergence or exceeding maximum iterations
     Get new embedding .
     
  end for
  Perform spectral clustering on .
  Clustering assignments, weight adjacency and embedding .
Algorithm 1 Algorithm to optimize AdaGAE

3.3 Adaptive Graph Auto-Encoder

In last subsection, the weighted adjacency matrix is regarded as fixed during training phase. However, the weighted adjacency matrix is computed through Eq. (4). The whole clustering process should contain connectivity learning and hence, the weighted adjacency should be updated adaptively during training. An intuitive approach is to recompute the connectivity distribution based on the embedding , which contains potential manifold structure information of data. However, the following theorem shows that the simple update based on latent representations may lead to collapse.

Theorem 2.

Denote where is generated by GAE with sparsity . If approximates well (numerically) then the solution of

(11)

, with the same sparsity, degenerates into an unweighted adjacency matrix.

Intuitively, the unweighted graph is indeed a bad choice for classical clustering tasks. Therefore, the update step with the same sparsity coefficient may result in collapse. To address this problem, we assume that

Assumption 1.

Suppose that the sparse and weighted adjacency is good enough. Specifically, weights of edges are large if it is within a cluster, or small otherwise. Then, under latent representation, samples belonging to the same cluster become more cohesive measured by Euclidean distance.

According to the above assumption, samples from a cluster are more likely to lie in a local area after GAE mapping. Hence, the sparsity coefficient increases when updating weight sparsity. The step size which increases with needs to be discussed. In an ideal situation, we can define the upper bound of as

(12)

where denotes the -th cluster. Although is not known, we can define empirically to ensure . For instance, can be set as or . Accordingly, the step size where is the number of iterations to update the weight adjacency.

To sum up, Algorithm 1 summarize the whole process of Adaptive Graph Auto-Encoder (AdaGAE).

4 Optimization and Theoretical Analysis

In this section, we first show how to solve problem (4) analytically. Then proofs of the two mentioned theorems are elaborated respectively.

4.1 Optimization of Problem (4)

To keep notations uncluttered, is simplified as . Then the problem (4) is equivalent to solve the following subproblem individually

(13)

To keep the discussion more concise, the subscript is neglected. Due to is constant, we have

(14)

Then the Lagrangian of the above equation is

(15)

where and are Lagrangian multipliers. According to KKT conditions,

(16)

It is not hard to verify that

(17)

where . Without loss of generality, suppose that . According to Theorem 1, , or equivalently, where . Due to , we have

(18)

Substitute Eq. (6) into Eq. (18) and we have

(19)

If , then it is not hard to verify that Eq. (19) is also the optima. Accordingly, the connectivity distribution can be calculated via close-form solution.

4.2 Proof of Theorem 1

To keep notations uncluttered, we make the same assumption, .

Proof.

To begin with, the auxiliary function

(20)

is non-decreasing with . Note that is strictly increasing when . If , there must exist that satisfies . In this case, is -sparse. According to Eq. (18), we have

(21)

When satisfies Eq. (5), we have

(22)

according to the non-decreasing property of the auxiliary function . If , then

(23)

Hence, which leads to contradiction. If

(24)

In the first inequality, the equality will never hold due to . Accordingly, is at least -sparse, which lead to contradiction as well. Therefore, we have .

When , it is not hart to verify that which also leads to contradiction. Finally, will never hold due to the constraint .

In sum, the theorem is proved. ∎

4.3 Analysis of Degeneration

In this part, we will first prove Theorem 2 and then explains the phenomenon from a different perspective. Now, we give the proof of Theorem 2.

Proof.

Consider the connectivity distribution of and suppose that . When approximates 0 numerically, we have

(25)

If we update by Eq. (19) based on , then for any ,

(26)

due to . The proof is easy to extend to other nodes. Hence, the theorem is proved. ∎

On the other hand, the following theorem demonstrates that the SoftMax output layer with is equivalent to solve problem (3) with a totally different regularization. Therefore, the perfect approximation may lead to bad performance.

Theorem 3.

The decoder of AdaGAE is equivalent to solve the following problem

(27)

where represents the entropy of connectivity distribution of node .

The proof is stated in supplementary material.

4.4 Spectral Analysis

As mentioned in subsection 3.2, AdaGAE generates a weighted graph with adaptive self-loops. Analogous to SGC (Wu et al., 2019), adaptive self-loops also reduce the spectrum of normalized Laplacian. Formally,

Theorem 4.

Let and

. According to eigenvalue decomposition, suppose

and . The following inequality always holds

(28)

The proof is stated in supplementary material. This theorem indicates the adaptive self-loops smooth the Laplacian matrix as well.

Methods UMIST JAFFE ORL PALM COIL20 YALE USPS
SC 29.01 70.14 56.43 19.54 56.42 29.58 28.75
K-Means 42.87 72.39 54.75 70.39 58.26 41.33 64.67
FKSC 50.64 79.58 69.33 75.30 69.78 50.96
CAN 69.62 96.71 68.00 88.10 84.10 49.09 67.96
DFKM 45.47 90.83 61.43 67.45 60.21 51.43 73.42
DEC 36.47 62.95 27.45 74.35 50.42 42.30 71.22
GAE (fixed ) 73.22 96.71 71.75 88.30 92.43 57.58 67.48
AdaGAE (fixed sparsity) 32.00 47.42 68.00 91.80 33.82 57.58 34.09
AdaGAE 83.48 97.27 71.40 95.25 93.75 57.58 91.96
Table 1: ACC (%)
Methods UMIST JAFFE ORL PALM COIL20 YALE USPS
SC 30.77 77.94 75.69 37.25 71.06 34.78 21.54
K-Means 65.47 80.90 75.43 89.98 74.58 48.71 62.88
FKSC 67.67 84.24 83.32 93.28 80.75 50.38
CAN 87.75 96.39 83.58 97.08 90.93 56.18 78.85
DFKM 67.04 92.01 81.13 86.74 76.81 55.92 71.58
DEC 56.96 82.83 55.22 90.37 69.43 48.71 72.25
GAE (fixed ) 87.04 96.39 84.51 96.96 97.26 76.93 76.45
AdaGAE (fixed sparsity) 52.08 59.55 83.65 97.80 55.46 76.93 35.39
AdaGAE 91.03 96.78 85.34 98.18 98.36 76.93 84.81
Table 2: NMI (%)
(a) Raw features
(b) AdaGAE with fixed sparsity
(c) GAE with fixed
(d) AdaGAE
(e) Raw features
(f) AdaGAE with fixed sparsity
(g) GAE with fixed
(h) AdaGAE
Figure 2:

t-SNE visualization on UMIST and USPS: The first line illustrates results on UMIST and the second line shows results on USPS. Clearly, AdaGAE projects semblable samples into the same embedding. Notice that a few data points are projected into wrong group which are usually regarded as outliers.

5 Experiments

In this section, details of AdaGAE are demonstrated and the results are shown. The visualization supports theoretical analysis mentioned in the last section.

5.1 Datasets

AdaGAE are evaluated on totally 7 datasets, including UMIST (Hou et al., 2013), JAFFE (Lyons et al., 1999), ORL (Cai et al., 2010), PALM, COIL20 (Nene et al., 1996), YALE (Georghiades et al., 2001) and USPS (Hull, 1994). The concrete information is summarized in Table 3. All features are rescaled to . As how to apply GCN on large graph is still a challenging problem and it is not the focus of this paper, we only conduct experiments on small and middle scale datasets. A feasible method is to replace the encoder with GCN designed for large scale datasets, such as StoGCN (Chen et al., 2018), Cluster-GCN (Chiang et al., 2019), etc.

Name # Features # Size # Classes
UMIST 1024 575 20
JAFFE 1024 213 10
ORL 1024 400 40
PALM 256 2000 100
COIL20 1024 1440 20
YALE 1024 165 15
USPS 256 9298 10
Table 3: Information of Datasets

5.2 Compared Methods

To evaluate AdaGAE, totally 6 methods are compared, including Spectral Clustering (SC) (Ng et al., 2002), K-Means, FKSC (Zhang et al., 2018), CAN (Nie et al., 2014), DFKM (Zhang et al., 2019) and DEC (Xie et al., 2016). Roughly speaking, SC, K-Means, FKSC and CAN are traditional clustering approaches while DFKM and DEC are deep clustering models with different embedded clustering models. Hyper-parameters of these methods are searched via the same pattern recorded in the corresponding papers. Codes of these methods are downloaded from homepages of authors.

5.3 Experimental Setup

In our experiments, the encoder consists of two GCN layers. If the input dimension is 1024, the first layer has 256 neurons and the second layer has 64 neurons. Otherwise, the two layers have 128 neurons and 64 neurons respectively. The activation function of the first layer is set as ReLU while the other one employs linear function. The initial sparsity

is set as 5 and the upper bound is searched from . The tradeoff coefficient is searched from . The number of graph update step is set as 10 and the maximum iterations to optimize GAE varies in .

To verify the effect of the adaptive process, two extra experiments are conducted: GAE with fixed and AdaGAE with fixed sparsity . Note that all hyper-parameters are same except for the specific setting.

Two popular clustering metrics, accuracy (ACC) and normalized mutual information (NMI

), are employed to evaluate performance. All methods are run 10 times and the means are reported. The code is implemented under pytorch-1.3.1 on a Windows 10 PC with a NVIDIA GeForce GTX 1660 GPU and 8 i7 cores. The exact values of hyper-parameters can be found in supplementary material.

5.4 Experimental Results

ACC and NMI of all mentioned methods are summarized in Table 1 and Table 2, respectively. The best results of both competitors and AdaGAEs are highlighted in boldface. From Table 1 and Table 2, we conclude that:

  • Classical deep clustering models suffer from overfitting and work poorly on small scale datasets while AdaGAE is stable on all datasets. Moreover, the improvement caused by graph convolution is impressive. Specifically, AdaGAE outperform DFKM about 20% on ACC and 7% on NMI for USPS.

  • When the sparsity keeps fixed, AdaGAE collapses on UMIST, JAFFE, COIL20 and USPS. For example, ACC shrinks about 50% and NMI shrinks about on COIL20.

  • From the comparison of two extra experiments, we confirm that the adaptive graph update process plays a positive role on most datasets except for ORL, which may be caused by too few samples in each cluster such that it is hard to define and .

Besides, Figure 2 illustrates the learned embedding vividly. Combining with Theorem 2, if is fixed as a constant, then degenerates into an unweighted adjacency matrix and a cluster is broken into a mass of groups. Each group only contains a small amount of data points and they scatter chaotically which leads to collapse. Instead, the adaptive process introduced in Section 3.3 connects these groups before degeneration via increasing sparsity and hence, the embeddings in a cluster become pretty cohesive. It should be emphasized that a large frequently leads to capture wrong information. After transformation of GAE, the nearest neighbors are more likely to belong with a same cluster and thus it is rational to increasing with an adequate step size.

6 Conclusion

In this paper, we propose a novel clustering model for general data clustering, namely Adaptive Graph AutoEncoder (AdaGAE). Generative graph representation model is utilized to construct a weighted graph with steerable sparsity. To exploit potential information of data, we employ graph convolution operator and thus a graph auto-encoder with local structure preserving is designed. More importantly, as the graph used in GAE is constructed artificially, an adaptive update step is developed to update graph with the help of learned embedding. Related theoretical analysis demonstrates the reason why AdaGAE with fixed sparsity collapses in update step. In experiments, we verify the effectiveness of adaptive update step by removing the corresponding part of AdaGAE. Surprisingly, the visualization supports the theoretical analysis well and confirms the necessary of the adaptive graph update.

References

  • Abu-El-Haija et al. (2019) Abu-El-Haija, S., Perozzi, B., Kapoor, A., Alipourfard, N., Lerman, K., Harutyunyan, H., Ver Steeg, G., and Galstyan, A. Mixhop: Higher-order graph convolutional architectures via sparsified neighborhood mixing. In International Conference on Machine Learning, pp. 21–29, 2019.
  • Bruna et al. (2013) Bruna, J., Zaremba, W., Szlam, A., and LeCun, Y. Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203, 2013.
  • Cai et al. (2010) Cai, D., Zhang, C., and He, X.

    Unsupervised feature selection for multi-cluster data.

    In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 333–342, 2010.
  • Cao et al. (2016) Cao, S., Lu, W., and Xu, Q. Deep neural networks for learning graph representations. In

    Thirtieth AAAI conference on artificial intelligence

    , 2016.
  • Chen et al. (2018) Chen, J., Zhu, J., and Song, L.

    Stochastic training of graph convolutional networks with variance reduction.

    In International Conference on Machine Learning, pp. 942–950, 2018.
  • Chiang et al. (2019) Chiang, W.-L., Liu, X., Si, S., Li, Y., Bengio, S., and Hsieh, C.-J. Cluster-gcn: An efficient algorithm for training deep and large graph convolutional networks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 257–266, 2019.
  • Georghiades et al. (2001) Georghiades, A., Belhumeur, P., and Kriegman, D.

    From few to many: Illumination cone models for face recognition under variable lighting and pose.

    IEEE Transactions on Pattern Analysis & Machine Intelligence, (6):643–660, 2001.
  • Grover & Leskovec (2016) Grover, A. and Leskovec, J. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 855–864, 2016.
  • Hinton & Salakhutdinov (2006) Hinton, G. E. and Salakhutdinov, R. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006.
  • Hou et al. (2013) Hou, C., Nie, F., Li, X., Yi, D., and Wu, Y. Joint embedding learning and sparse regression: A framework for unsupervised feature selection. IEEE Transactions on Cybernetics, 44(6):793–804, 2013.
  • Hull (1994) Hull, J. A database for handwritten text recognition research. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(5):550–554, 1994.
  • Kingma & Welling (2014) Kingma, D. P. and Welling, M. Auto-encoding variational bayes. In ICLR, 2014.
  • Kipf & Welling (2016) Kipf, T. N. and Welling, M. Variational graph auto-encoders. arXiv preprint arXiv:1611.07308, 2016.
  • Kipf & Welling (2017) Kipf, T. N. and Welling, M. Semi-supervised classification with graph convolutional networks. In ICLR, 2017.
  • Lyons et al. (1999) Lyons, M., Budynek, J., and Akamatsu, S. Automatic classification of single facial images. IEEE transactions on pattern analysis and machine intelligence, 21(12):1357–1362, 1999.
  • Makhzani et al. (2015) Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I., and Frey, B. Adversarial autoencoders. arXiv preprint arXiv:1511.05644, 2015.
  • Nene et al. (1996) Nene, S., Nayar, S., and Murase, H. Columbia object image library (coil-20). 1996.
  • Ng et al. (2002) Ng, A. Y., Jordan, M. I., and Weiss, Y.

    On spectral clustering: Analysis and an algorithm.

    In Advances in neural information processing systems, pp. 849–856, 2002.
  • Nie et al. (2014) Nie, F., Wang, X., and Huang, H. Clustering and projected clustering with adaptive neighbors. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 977–986. ACM, 2014.
  • Niepert et al. (2016) Niepert, M., Ahmed, M., and Kutzkov, K. Learning convolutional neural networks for graphs. In International conference on machine learning, pp. 2014–2023, 2016.
  • Pan et al. (2018) Pan, S., Hu, R., Long, G., Jiang, J., Yao, L., and Zhang, C. Adversarially regularized graph autoencoder for graph embedding. In IJCAI, pp. 2609–2615, 2018.
  • Park et al. (2019) Park, J., Lee, M., Chang, H. J., Lee, K., and Choi, J. Y. Symmetric graph convolutional autoencoder for unsupervised graph representation learning. In

    Proceedings of the IEEE International Conference on Computer Vision

    , pp. 6519–6528, 2019.
  • Perozzi et al. (2014) Perozzi, B., Al-Rfou, R., and Skiena, S. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 701–710. ACM, 2014.
  • Scarselli et al. (2008) Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., and Monfardini, G. The graph neural network model. IEEE Transactions on Neural Networks, 20(1):61–80, 2008.
  • Shaham et al. (2018) Shaham, U., Stanton, K., Li, H., Nadler, B., Basri, R., and Kluger, Y. Spectralnet: Spectral clustering using deep neural networks. arXiv preprint arXiv:1801.01587, 2018.
  • Wang et al. (2017) Wang, C., Pan, S., Long, G., Zhu, X., and Jiang, J. Mgae: Marginalized graph autoencoder for graph clustering. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 889–898, 2017.
  • Wang et al. (2019) Wang, C., Pan, S., Hu, R., Long, G., Jiang, J., and Zhang, C. Attributed graph clustering: a deep attentional embedding approach. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pp. 3670–3676. AAAI Press, 2019.
  • Wang et al. (2016) Wang, D., Cui, P., and Zhu, W. Structural deep network embedding. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1225–1234, 2016.
  • Wang et al. (2018) Wang, H., Wang, J., Wang, J., Zhao, M., Zhang, W., Zhang, F., Xie, X., and Guo, M. Graphgan: Graph representation learning with generative adversarial nets. In Thirty-Second AAAI Conference on Artificial Intelligence, pp. 2508–2515, 2018.
  • Wu et al. (2019) Wu, F., Zhang, T., Holanda de Souza, A., Fifty, C., Yu, T., and Weinberger, K. Q. Simplifying graph convolutional networks. Proceedings of Machine Learning Research, 2019.
  • Xie et al. (2016) Xie, J., Girshick, R., and Farhadi, A. Unsupervised deep embedding for clustering analysis. In International conference on machine learning, pp. 478–487, 2016.
  • Xu et al. (2019) Xu, K., Hu, W., Leskovec, J., and Jegelka, S. How powerful are graph neural networks? In Proc. of ICLR, 2019.
  • Zhang et al. (2018) Zhang, R., Nie, F., Guo, M., Wei, X., and Li, X. Joint learning of fuzzy k-means and nonnegative spectral clustering with side information. IEEE Transactions on Image Processing, 28(5):2152–2162, 2018.
  • Zhang et al. (2019) Zhang, R., Li, X., Zhang, H., and Nie, F. Deep fuzzy k-means with adaptive loss and entropy regularization. IEEE Transactions on Fuzzy Systems, pp. 1–1, 2019. ISSN 1941-0034. doi: 10.1109/TFUZZ.2019.2945232.

Appendix A Proof of Theorem 3

Proof.

Problem (27) is equivalent to the following -th subproblem

(29)

Similarly, the subscript is omitted to keep notations uncluttered. The Lagrangian is

(30)

Then the KKT conditions are

(31)

Due to , . Use the first line, we have

(32)

Combine it with the second line and we have

(33)

Furthermore, we have

(34)

With , the theorem is proved. ∎

Appendix B Proof of Theorem 4

It should be pointed out that the proof imitates the corresponding proof in (Wu et al., 2019). Analogous to Lemma 3 in (Wu et al., 2019), we first give the following lemma without proof,

Lemma 1.

Let be eigenvalues of and be eigenvalues of . The following inequality always holds

(35)

The proof is apparent according to Lemma 3 provided in (Wu et al., 2019). The proof is given as follows

Proof.

Let and we have

Dataset
UMIST 1 5 50 10
COIL20 1 5 100 10
JAFFE 5 20 10
PALM 10 50 10
YALE 5 150 10
ORL 5 150 10
USPS 5 150 10
Table 4: : local information, : initial sparsity, : learning rate, : regularization coefficient, : number of iterations to update GAE

Appendix C Experimental Details

The exact values of AdaGAE in our experiments are reported in Table 4. To use SC, we construct the graph via Gaussian kernel, which is given as

(36)

where represents -nearest neighbors of sample . is searched from and is searched from .

The maximum iterations of GAE with fixed is set as 200. Note that its objective function is defined as Eq. (10).

Codes of K-means, SC, FKSC, and CAN are implemented under MATLAB 2019a, while codes of DFKM, DEC and AdaGAE are implemented under python 3.6.

For all datasets, we simply rescale features into . All datasets are downloaded from http://www.escience.cn/people/fpnie/index.html.