Clustering, which plays import roles in plenty of applications, is one of the most fundamental topics in machine learning and pattern recognition. In past decades, a spectrum of algorithms have been proposed(Dunn, 1973; Ester et al., 1996; Nie et al., 2014). In particular, k-means(Dunn, 1973), which exploits density information to group samples, and spectral clustering(Shi & Malik, 2000; Ng et al., 2002)
, which aims to solve a graph cut problem, are two most popular methods. K-means assigns each sample to certain cluster and then updates all centroids of clusters according to current assignments. Spectral clustering constructs a similarity matrix (which can be regarded as a weighted graph) and then performs eigenvalue decomposition on its Laplacian matrix to obtain relaxed indicator matrix. Finally, group rows of indicator matrix through certain methods, such as k-means and spectral rotation. Drawbacks of these two algorithms are obvious: K-means is too simple to work on non-spherical data while spectral clustering depends severely on construction of similarity.
is a well-known deep model that can extract deep features by encoder and reconstruct raw features by decoder. Stacked Auto-Encoder (SAE)(Tian et al., 2014a) trains auto-encoder layer-wisely. An obvious drawback is that the clustering phase and auto-encoder training phase are independent. In other words, the extracted features may not be beneficial for clustering. DEC (Xie et al., 2016) embeds k-means with auto-encoders and optimizes both of neural network and k-means model by SGD. However, the optimization through SGD is inefficient since it needs lots of iterations to converge. Moreover, the relationship among data points may be neglected by auto-encoder as auto-encoder cannot process links between samples.
In recent years, graph convolution network (Scarselli et al., 2008; Wu et al., 2019) has attracted lots of attentions as it extends convolution operation and neural networks to graph type data. In network embedding, the goal is to learn representations for each node (or graph) (Wang et al., 2018). The learned embedding is used for graph classification, node clustering and link prediction. Similarly, graph auto-encoder (GAE) (Kipf & Welling, 2016) is an extension of auto-encoder, which learns node representation with adjacency and reconstructs the adjacency from the extracted features. The embedding is frequently used for node clustering and link prediction. However, it often suffers from overfitting and the generated representations may be unsuitable for clustering.
In this paper, we propose a graph auto-encoder based clustering model, Embedding Graph Auto-Encoder with JOint Clustering via Adjacency Sharing (EGAE-JOCAS). The main contributions include: 1) EGAE-JOCAS incorporates a clustering model into graph encoder. Therefore, GAE and clustering model are optimized simultaneously, which can affect GAE to generate more appropriate representations for clustering. The optimization is composed of SGD and close-form solutions. 2) We develop a novel clustering model, named as joint clustering, to employ learned graph embedding more rationally. Joint clustering is a fusion of relaxed k-means and spectral clustering, and the adjacency is shared by both GAE and joint clustering. Meanwhile, we also design a decoder of GAE to generate adequate representations for joint clustering. 3) EGAE-JOCAS can incorporate any more complicated mechanisms such as attention, pooling and so on.
In this paper, all matrices are represented by uppercase words and all vectors are denoted by bold lowercase words. For a matrix, is the -th row vector, is the -th column vector, is the trace of , is the transpose of and means all elements are non-negative. denotes the identify matrix, denotes vector whose elements are all 1. is the gradient operator. and denotes row and column vectors respectively.
2 Related Work
In this section, we first review works about graph convolution neural network for graph embedding. Then we introduce several representative models which aim to employ deep learning to perform clustering.
2.1 Graph Convolution Neural Network
Graph, which can describe complicated relationship, has been widely studied and applied in diverse applications, such as community network, bioscience, recommender system and so on (Bojchevski et al., 2018; Wang et al., 2017; Liu et al., 2015). GraphGAN (Wang et al., 2018) incorporates adversarial learning into graph embedding learning and develops a graph softmax function to utilize the structure of graph. Literature (Wang et al., 2017) combines graph embedding and clustering to perform clustering on graph. However, capacities of these methods are limited and they only utilize the shallow information of graph. As a result, they cannot make use of structure information effectively and exploit potential relationship among nodes.
Due to success of CNN, graph convolution neural networks have been studied widely in recent years. A focusing problem is how to extend convolution operation to irregular data. All existing convolution operations can be split as 2 categories, spectral based methods (Bruna et al., 2013; Kushnir et al., 2006; Kipf & Welling, 2017) and spatial based methods (Niepert et al., 2016; Micheli, 2009; Veličković et al., 2017) according to (Wu et al., 2019). On the one hand, spectral-based methods(Bruna et al., 2013; Kushnir et al., 2006; Kipf & Welling, 2017)
are motivated by convolution theorem, the Fourier transformation and the characteristics of Laplacian operator(Shuman et al., 2013)
. In spectral-based models, the spectral domain is regarded as frequency domain which is a fundamental concept in the Fourier transformation. A natural thought is to use a normal matrix as convolutional kernel to learn. A distinct drawback is that parameters of kernels are too many to learn. To address this problem,(Bruna et al., 2013) assumes that the convolutional kernel in spectral domain is diagonal. To further reduce the amount of parameters to learn and promote efficiency of convolution, ChebNet (Kushnir et al., 2006) utilizes Chebyshev polynomials to approximate the convolutional kernel. On the other hand, spatial-based methods do not transform the domain and focus on how to select nodes to perform convolution. For example, PATCH-SAN (Niepert et al., 2016) orders other nodes for a node and chooses top neighbors to perform convolution. In particular, GCN proposed in (Kipf & Welling, 2017) combines spectral-based models and spatial-based ones. It employs linear approximation of convolution kernels via Chebyshev polynomials. Moreover, it can be regarded as a spatial method which only considers 1-neighbor to perform convolution.
2.2 Deep clustering
Auto-encoder (Hinton & Salakhutdinov, 2006)
, as a classical variant of neural network for unsupervised learning, is often employed to perform clustering, such as SAE(Tian et al., 2014a) and StructAE (Peng et al., 2018). Roughly speaking, they employ a multi-layer neural network (encoder) to learn non-linear features and reconstruct raw features from the learned features by decoder. However, SAE and StructAE separate clustering from training auto-encoder. DEC (Xie et al., 2016) embeds k-means into auto-encoder but optimizes it by SGD. Actually, gradient descent is an approximate method when the objective function can not be solved directly. Therefore, SGD is not the best algorithm to solve embedding models such that DEC converges slowly. SpectralNet (Shaham et al., 2018), which is not based on auto-encoder, intends to perform spectral clustering with neural networks.
However, all these methods fail to utilize structure information provided by graph type data. Graph auto-encoder (Kipf & Welling, 2016; Pan et al., 2018) extends classical auto-encoder to graph but it fails to consider clustering. In this paper, we propose a novel model that embeds a joint clustering model into graph auto-encoder, so that it can exploit relationship of structure information.
3 Embedding Graph Auto-Encoder with Joint Clustering via Adjacency Sharing
In this section, we will show how the graph auto-encoder (GAE) works and propose the Embedding Graph Auto-Encoder with JOint Clustering via Adjacency Sharing (EGAE-JOCAS). In section 5, we will demonstrate the rationality of utilization of GAE from both intuitive aspect and theoretical aspect.
3.1 Convolution on Graph
To apply convolution operation on irregular data, we utilize a graph convolution operation that can be explained as both spectral operator and spatial operator. According to the convolution theorem, the convolution operator can be defined from the frequency domain, which is conventionally named as spectral domain of graph signals. Formally, a spatial signal of graph can be transformed into spectral domain by where (eigenvalue decompostion) and is normalized Laplacian matrix. If the convolution kernel is constrained as a function of , a spectral convolution operator can be defined as
Frequently, suppose that is diagonal and can be approximated by Chebyshev polynomials. If the linear approximation is utilized, the convolution can be defined as
where . To reduce the amount of parameters to learn and simplify the graph convolutional network, suppose that , . Accordingly, the above equation becomes . Furthermore, we can renormalize the convolutional matrix as
where . Therefore, the processed signal can be rewritten as
If the graph signal is multiple-dimensional, i.e., , and convolution kernels are applied, then we have
where is parameters to learn.
On the contrary, the mentioned convolution can be regarded as a spatial convolution operator since is equivalent to the weighted average of one node and its 1-neighbors. The intuitive explanation of general graph convolution network used in our model is shown in Figure 1.
3.2 Graph Auto-Encoder
To apply GCN on clustering, graph auto-encoder (GAE) incorporates the traditional auto-encoder and GCN. Like traditional auto-encoder, GAE consists of a encoder and a decoder.
Encoder: Encoder attempts to learn a latent representation of raw data via multiple graph convolution layers which have been introduced in previous subsection. In this paper, to keep simplicity, only simple GCN layers are employed. However, the encoder can be constituted by any valid layers such as attention, pooling, etc.
Decoder: Decoder intends to reconstruct the adjacency from the embedding. In network embedding, adjacency is usually given as prior and the graph is frequently unweighted. Therefore, the adjacency matrix satisfies that if the -th point is connected with the -th one; , otherwise. Besides, similarity between two nodes is conventionally measured by inner-product. Specifically, is analogous to if is large enough. Therefore, the decoder is designed as where
denotes the sigmoid function.
We further assume that all inner-products are non-negative, i.e., . This assumption is crucial for the embedded clustering model, joint clustering, which will be described in next subsection. The corresponding theoretical analysis will be shown in section 5. According to this, the activation function of the last encoding-layer should be non-negative (e.g., ReLU) so that . However, will lie in
which can not be regarded as probability. To address this issue, it is necessary to find a mapping functionwhich maps into so that outputs valid probabilities. A natural choice is . However, is too insensitive to large values, which means that will approach 1 slowly with increasing. An ideal should approximate when inputs are large enough. Motivated by this requirement, we design
which is demonstrated in Figure 3. Hence, the decoder used in our model is
3.3 Proposed Model
As k-means partitions data via density information and spectral clustering takes local consistency into account, a clustering model incorporating both advantages is developed and is named as joint clustering, which is defined as
where is the relaxed indicator, and is a tradeoff coefficient. The first term corresponds to k-means but with a continuous indicator , while the second term is the Ratio Cut used spectral clustering.
Accordingly, the objective of GAE with Eq. (8) embedded is defined as
represents certain loss function that measures the reconstruction error betweenand , is the tradeoff coefficient and is extracted deep features. To avoid overfitting caused by GAE, a Lasso regularization is employed and the final objective function of EGAE-JOCAS is formulated as
where is parameters of the -th layer, is the amount of layers used in encoder and . As the adjacency is shared in both joint clustering and GAE, the spectral clustering term can be viewed as a regularization that utilizes prior structure information. Both of and are pretty easy to choose and can be set as constants for all datasets for simplicity.
Fix GAE and perform clustering: As the GAE is fixed, the problem (10) is equivalent to
Since there is no constraint on , we can take the derivative w.r.t. ,
which means that
Hence, the indicator matrix can be computed by performing eigenvalue decomposition of . In other words, the optimal consists of the leading eigenvectors of .
Update GAE with and fixed: When clustering related variables are fixed as constants, the objective function is equivalent to
which can be optimized by the standard gradient descent.
5 Motivation and Analysis
In this section, we will demonstrate the rationality of joint learning of GAE and the joint clustering from both intuitive aspect and theoretical aspect.
5.1 Intuitive Illustration
To understand the motivation of EGAE-JOCAS, we give a visual interpretation via experiments on toy dataset. Roughly speaking, GAE, used in our model, can map data into a non-linear subspace which measures similarity via inner-products. Figure 4 illustrates the effect of EGAE-JOCAS built on orthogonality on 2-rings data. In the experiment, links among points of the same cluster are set with probability 0.9. From the figure, we realize that with the number of iterations increasing, 1) samples that connect with each other become more cohesive; 2) samples of different clusters go as orthogonal as possible. In other words, we can obtain orthogonal projected-points when the prior adjacency contains sufficient information.
In next subsection, we will prove that the relaxed k-means obtains optimal partition on orthogonal data and thus, it is vital to employ relaxed k-means as a part of embedded clustering model.
5.2 Theoretical Analysis
When the embedding is fixed, the relaxed k-means is equivalent to
If the similarity is measured by inner-product which is non-negative, then can be regarded as similarity among data points. It is pretty natural to integrate graph auto-encoder and spectral relaxed k-means, as EGAE-JOCAS attempts to project raw data into a latent feature space which measures similarity among data points via inner-product. The following lemma and theorem shows us the spectral relaxed k-means can be applied on this type data.
Lemma 1.111Due to limitation of space, the proof is stated in supplementary material.
For any positive and symmetric matrix , the most principle component satisfies that all elements are not zero and have the same sign. More formally,
Provided that inner-products between any two data points are non-negative and two samples are orthogonal to each other if and only if they belong to different clusters, the spectral relaxed k-means can give an ideal partition.
If samples from different clusters are orthogonal and inner-products of samples from the same cluster are positive, then we have
where and consists of samples from the -th cluster. According to our assumption, we have . With the help of Lemma 2, there exists which satisfies that
Therefore, a valid relaxed can be given as
However, given a orthogonal matrix, any which satisfies is a valid solution. If we normalize rows of as , then we have
Accordingly, it will obtain ideal partition if the k-means is performed on rows of . ∎
The following theorem will show the connection between the relaxed k-means and normalized cut spectral clustering.
Problem (16) is equivalent to spectral clustering with normalized cut if and only if the mean vector is perpendicular to the centralized data , or equivalently, is perpendicular to the affine space that all data points lie in.
The objective function of normalized cut spectral clustering is given as
Note that . Let be a centralized matrix. Then the centralized data can be represented as .
On the one hand, if is perpendicular to the centralized data, then we have
Accordingly, which means that data points lie in an affine space, and the centralized data is perpendicular to
Hence, the theorem is proved. ∎
Although the ideal condition seems to be too rigorous, EGAE-JOCAS has the ability to project points into a non-linear feature space that data points assigned to different clusters are orthogonal (uncorrelated) to each other.
We state crucial details of experiments and clustering results of EGAE-JOCAS. The visualization is shown in Figure 6. In particular, we extend experiments into traditional datasets besides two graph type datasets.
6.1 Benchmark Datasets
|Dataset||# Nodes||# Links||# Features||# Classes|
To verify the effectiveness of EGAE-JOCAS, experiments are conducted on not only datasets with prior adjacency but also normal datasets that need to construct adjacency at first.
Cora and Citeseer consist of adjacency and content. Each node represents a publication and link denotes a citation between two papers. Content features correspond to a word vector. YALE (Georghiades et al., 2001) and UMIST (Hou et al., 2013) are traditional human faces datasets, which have no adjacency. The information of datasets is summarized in Table 1.
6.2 Baseline Methods
Totally ten algorithms are compared in the experiments: K-means, Spectral Clustering (SC), Graph Encoder (Tian et al., 2014b), Deep Walk (Perozzi et al., 2014), DNGR(Cao et al., 2016), GAE (VGAE) (Kipf & Welling, 2016), ARGE (ARVGE) (Pan et al., 2018), KKM (Dhillon et al., 2004), Auto-Encoder (Hinton & Salakhutdinov, 2006), DEC (Xie et al., 2016). To test the effectiveness of different parts of EGAE-JOCAS, we conduct ablation experiments with and .
For Cora and Citeseer, only models designed for network embedding are used as competitors while for YALE and UMIST, several representative methods are compared.
6.3 Experimental Settings
In our experiments, the encoder is a two-layer GCN. Activation functions of the first layer and the second layer are linear function and ReLU, respectively. Although and are hyper-parameters to tune manually, it is pretty easy to set them. In our experiments, we set and empirically. Hence, there is only one tradeoff coefficient, , to tune. We search in the range of . The maximum inner-iteration is set as 10 and the maximum outer-iteration is set as 50. Besides, the learning rate is set as .
For Cora and Citeseer, the encoder is composed of 32-neurons hidden layer and 16-neurons embedding layer. As the adjacency is unweighted and discrete,is the cross entropy loss function. For YALE and UMIST, the encoder has a 200-neuron hidden layer and a 50-neuron embedding layer. As no adjacency is provided as priori, the following formulation is employed to construct adjacency
where and is -nearest neighbors of the -th sample. Otherwise, . Then is computed by . Mean square error (MSE) is chosen as since constructed adjacency is weighted and continuous. All methods are performed 10 times and the means are reported. More details is stated in supplementary.
6.4 Experimental Results
Graph type datasets:
Clustering results of Cora and Citeseer are reported in Table 2 and 3 respectively. On both of Cora and Citeseer, GAE and its extensions outperform other network embedding models. Specifically speaking, GAE increases ACC more than 10 percent compared with Deep Walk. Similar with traditional auto-encoder, GAE suffers from overfitting. ARGE and ARVGE absorb adversarial learning into GAE to address this issue and thus promote the performance. In particular, EGAE-JOCAS obtains remarkable results on all metrics with the same structure of encoder. Compared with GAE, EGAE-JOCAS embeds joint clustering into GAE so that the extracted deep features are more appropriate for clustering.
Clustering results of YALE and UMIST are reported in Table 4. Since GAE suffers from overfitting, it works worse than Auto-Encoder. On the contrary, EGAE-JOCAS alleviates overfitting by adopting the designed joint clustering. Besides, it utilizes the structure information of data so that it achieves better results.