Clustering algorithms group similar samples together based on some predefined notion of similarity, often defined through kernels. However, the choice for an appropriate kernel is data-dependent; as a result, the kernel design process is frequently an art that requires intimate knowledge of the data. A common alternative is to simply use a general-purpose kernel that performs well under various conditions (e.g., polynomial or Gaussian kernels).
In this paper, we propose KernelNet (KNet), a methodology for learning a kernel, as well as an induced clustering, directly from the observed data. In particular, we train a deep
kernel, combining a neural network representation with a Gaussian kernel. More specifically, given a datasetof samples in , we learn a kernel of the form:
where is an embedding function modeled as a neural network parametrized by , and , , are normalizing constants. Intuitively, incorporating a neural network (NN) parameterization to a Gaussian kernel, we are able to learn a flexible deep kernel for clustering, tailored specifically to the given dataset.
We train our deep kernel with a spectral clustering objective based on the Hilbert Schmidt Independence Criterion(Gretton et al., 2005)
. Our training process can be interpreted as learning a non-linear transformationas well as its spectral embedding simultaneously. By appropriately selecting the initial values of our training process, we ensure that our clustering method is at least as powerful as spectral clustering. In particular, just as spectral clustering, our learned kernel and the induced clustering work exceptionally well on non-convex clusters. In practice, by training the kernel directly from the data, our proposed method significantly outperforms spectral clustering.
The non-linear embedding learned directly from the dataset allows us to readily handle out-of-sample data. Given a new sample , not observed before, we can easily identify its cluster label by first computing its image , effectively embedding it to the same space as the (already clustered) existing dataset. This is in contrast to spectral clustering, that would require a re-execution of the algorithm from scratch on the combined dataset of samples.
The aforementioned properties of our algorithm are illustrated in Fig. 1. A dataset of samples in with non-convex spiral clusters is shown in Fig. 1(a). Applying a Gaussian kernel directly to these samples leads to a highly uninformative similarity matrix, as shown in Fig 1(b). We train our embedding on only 1% of the samples, and apply it to the entire dataset; the dataset image, shown in Fig. 1(c), consists of nearly-convex, linearly-separable clusters. Most importantly, the corresponding learned kernel , shown in Fig. 1(d), yields a highly informative similarity matrix that clearly exhibits a block diagonal structure: this indicates high similarity within each cluster and low similarity across clusters.
In summary, we make the following contributions:
We propose a novel methodology of discovering a deep kernel tailored for clustering directly from data, using an objective based on the Hilbert-Schmidt Information Criterion.
We propose an algorithm for training the kernel by maximizing this objective, as well as for selecting a good parameter initialization. Our algorithm, KNet, can be perceived as alternating between training the kernel and discovering its spectral embedding.
We evaluate the performance of KNet with synthetic and real data compared to multiple state-of-the-art methods for deep clustering. In 5 out 6 datasets, KNet outperforms state-of-the-art by as much as 57.8%; this discrepancy is more pronounced in datasets with non-convex clusters, which KNet handles very well.
Finally, we demonstrate that the algorithm has exceptional performance in clustering out-of-sample data. We exploit this to show that KNet can be significantly accelerated through subsampling: learning the embedding on only 1%-35% of the data can be used to cluster an entire dataset, leading only to a 0%-3% degradation of clustering performance.
The remainder of the paper is organized as follows. In Sec. 2, we discuss related work. In Sec. 3, we provide a brief description of the Hilbert-Schmidt Information Criterion and its relationship to clustering. The formal definition of the problem we solve and our proposed algorithm (KNet) are discussed in Sec. 4 and Sec 5, respectively. Sec. 6 contains our experiments. Finally, we conclude in Sec. 7.
2 Related Work
Several recent works propose autoencoders specifically designed for clustering.Song et al. (2013) combine an autoencoder with -means, including an
-penalty w.r.t. distance to cluster centers, obtained by alternating between stochastic gradient descent (SGD) and cluster center assignment.Ji et al. (2017) incorporate a subspace clustering penalty to an autoencoder, and alternate between SGD and dictionary learning. Tian et al. (2014) learn a stacked autoencoder initialized via a similarity matrix. Xie et al. (2016) incorporate a KL-divergence penalty between the encoding and a soft cluster assignment, both of which are again alternately optimized; a similar approach is followed by Guo et al. (2017) and Hu et al. (2017). In KNet, we significantly depart from these methods by using an HSIC-based objective, motivated by spectral clustering. In practice, this makes KNet better tailored to learning non-convex clusters, on which the aforementioned techniques perform poorly: we demonstrate this experimentally in Section 6.
Our work is closest to Shaham et al. (2018), that use an objective motivated by spectral clustering to learn an embedding highly correlated to a given similarity matrix; this makes the embedding only as good as the chosen similarity. The resulting optimization corresponds to Eq. (6) in our paper: KNet can thus be seen as a generalization of Shaham et al. (2018) that learns both the embedding and the similarity matrix jointly; this eschews the need for discovering the right similarity tailored to the dataset, and leads to overall improved performance (see also Sec. 6).
KNet also has a direct relationship to methods for kernel learning. A series of papers (Wilson et al., 2011, 2016b, 2016a) regress deep kernels to model Gaussian processes. Zhou et al. (2004) learn (shallow) linear combinations of given kernels. Closest to us, Niu et al. (2011) use HSIC to jointly discover a subspace in which data lives as well as its spectral embedding; the latter is used to cluster the data. This corresponds to learning a kernel over a (shallow) projection of the data to a linear subspace. KNet, therefore, generalizes the work by Niu et al. (2011) to learning a deep, non-linear kernel representation (c.f. Eq. (9)), which improves upon spectral embeddings and is used directly to cluster the data.
3 Hilbert-Schmidt Independence Criterion
Proposed by Gretton et al. (2005)
, the Hilbert Schmidt Independence Criterion (HSIC) is a statistical dependence measure between two random variables. Like Mutual Information (MI), it measures dependence via the distance between the joint distribution of the two random variables against the product of their individual distributions. However, compared to MI, HSIC is easier to compute empirically, since it does not require a direct estimation of the joint distribution. It is used in many applications due to this advantage, including dimensionality reduction(Niu et al., 2011) et al., 2007), and alternative clustering (Wu et al., 2018), to name a few.
Formally, consider a set of i.i.d. samples , , where , are drawn from a joint distribution. Let and be the corresponding matrices comprising a sample in each row. Let also be the Gaussian kernel and be the linear kernel . Define to be the kernel matrices with entries and , respectively, and let be the normalized Gaussian kernel matrix given by
where the degree matrix
is a normalizing diagonal matrix. Then, the HSIC between and is estimated empirically via:
where is a centering matrix.
Intuitively, HSIC empirically measures the dependence between samples of the two random variables. Though HSIC can be more generally defined for arbitrary kernels, this particular choice has a direct relationship with (and motivation from) spectral clustering. In particular, given , consider the optimization:
For completeness, we prove this in Appendix A.
|the kernel function discovered by KNet|
|number of samples in a dataset|
|number of clusters|
|,||data samples in|
|,||data matrices in , where each row represents a single sample|
|Gaussian kernel parameter|
|,||kernel matrices computed on ,|
|normalized Gaussian kernel matrix, given by (1)|
|degree matrix, given by (2)|
|empirical estimate of HSIC, given by (3)|
a vector of lengthcontaining 1s in all entries
|the Laplacian of spectral clustering, given by (5)|
|spectral embedding of , obtained via eigendecomposition of or as solution to (4)|
|the weights of the encoder|
|the weights of the decoder|
|embedding for a single sample|
|embedding for the entire dataset|
|decoder portion of the autoencoder|
|autoencoder, given by (8)|
|spectral embedding of|
|pair ( and )|
|Objective of optimization problem (7)|
|the Laplacian of , given by (13)|
|-th diagonal element of degree matrix|
4 Problem Formulation
We are given a dataset of samples grouped in (potentially) non-convex clusters. Our objective is to cluster samples by first embedding them into a space in which the clusters become convex. Given such an embedding, clusters can subsequently be identified via, e.g., -means. We would like the embedding, modeled as a neural network, to be at least as expressive as spectral clustering: clusters separable by spectral clustering should also become separable via our embedding. In addition, the embedding should generalize to out-of-sample data, thereby enabling us to cluster new samples outside the original dataset.
Learning the Embedding and a Deep Kernel. Formally, we wish to identify clusters over a dataset of samples and features. Let be an embedding of a sample to , modeled as a DNN parametrized by ; we denote by the image of under parameters . We also denote by the embedding of the entire dataset induced by , and use for the image of .
Let be the spectral embedding of , obtained via spectral clustering. We can train to induce similar behavior as via the following optimization:
where is given by Eq. (3). Intuitively, seeing HSIC as a dependence measure, by training so that is maximally dependent on , becomes a surrogate to the spectral embedding, sharing similar properties.
However, the surrogate learned via (6) may only be as discriminative as . To address this issue, we depart from (6) by jointly discovering both as well as a coupled spectral embedding . In particular, we solve the following optimization problem w.r.t. both the embedding and :
where the unknowns are and , and
is an autoencoder, comprising and as an encoder and decoder respectively. The autoencoder ensures that the embedding is representative of the input . To gain some intuition into how problem (7) generalizes (6), observe that if the embedding is fixed to be the identity map (i.e., for ) then, by Eq. (4), optimizing only for produces the spectral embedding . The joint optimization of both and allows us to further improve upon , as well as on the coupled ; we demonstrate experimentally in Section 6 that this significantly improves clustering quality.
The optimization (7) can also be interpreted as an instance of kernel learning. Indeed, as discussed in the introduction, by learning , we discover in effect a normalized kernel of the form
where are the corresponding diagonal elements of degree matrix .
Out-of-Sample Data. The embedding can readily be applied to clustering out-of-sample data. In particular, having trained over dataset , given a new dataset , we can cluster this new dataset efficiently as follows. First, we use the pre-trained to map every sample in to its image, producing : this effectively embeds to the same space as . From this point on, clusters can be recomputed efficiently via, e.g., -means, or by mapping the images to the closest existing cluster head. In contrast to, e.g., spectral clustering, this avoids recomputing the joint embedding of the entire dataset from scratch.
The ability to handle out-of-sample data can be leveraged to also accelerate training. In particular, given the original dataset , computation can be sped up by training the embedding by solving (7) on a small subset of . The resulting trained can be used to embed, and subsequently cluster, the entire dataset. We show in Section 6 that this approach works very well, leading to a significant acceleration in computations without degrading clustering quality.
where are the elements of matrix . The exponential terms in Eq. (10) compel samples under which to become attracted to each other, while samples for which drift farther apart. This is illustrated in Figure 2. Linearly separable, increasingly convex cluster images arise over several iterations of solving our algorithm at Eq. (7). The algorithm, KNet, is described in the next section.
5 KNet Algorithm
We solve optimization problem (7) by iteratively adapting and . In particular, we initialize to be the identity map, and to be the spectral embedding . We subsequently alternate between adapting and We adapt via stochastic gradient ascent (SGA). To optimize , we adopt two approaches: one based on eigendecomposition, and one based on optimization over the Stiefel manifold. We describe each of these steps in detail below; a summary can be found in Algorithm 1.
Initialization. The non-convexity of (7) necessitates a principled approach for selecting good initialization points for and . We initialize to , computed via the top- eigenvectors of the Laplacian of , given by (5). We initialize so that is the identity map; This is accomplished by pre-training , via SGD as solutions to:
Note that, in this construction, we use .
Updating . A simple way to update is via a gradient ascent, i.e.:
for , where is the objective (7a). In practice, we wish to apply stochastic gradient ascent over mini-batches; for fixed, objective (7a) reduces to (10); however, the terms in the sum are coupled via the normalizing degree matrix , which depends on via (2). This significantly increases the cost of computing mini-batch gradients. To simplify this computation, instead we hold both and fixed, and update via one epoch of SGA over (7a). At the conclusion of one epoch, we update the Gaussian kernel and the corresponding degree matrix via Eq. (2
). We implemented both this heuristic and regular SGA, and found that it led to a significant speedup without any observable degradation in clustering performance (see also Section6).
Updating via Eigendecomposition. Our first approach to adapting relies on the fact that, holding constant, problem (7) reduces to the form (4). That is, at each iteration, for fixed, the optimal solution
is given by the top eigenvectors of Laplacian
Hence, given at an iteration, we update by returning . When , there are several efficient algorithms for computing the top eigenvectors–see, e.g., Fowlkes et al. (2004); Vladymyrov and Carreira-Perpiñán (2016).
Updating via Stiefel Manifold Ascent. The manifold in defined by constraint (7b), a.k.a. the Stiefel Manifold, is not convex; nevertheless, there is an extensive literature on optimization techniques over this set (Absil et al., 2008; Wen and Yin, 2013), exploiting the fact that descent directions that maintain feasibility can be computed efficiently. In particular, following Wen and Yin (2013), treating as a constant, and given a feasible and the gradient of the objective w.r.t , define
Using and a predefined step length , the maximization proceeds iteratively via:
where is the so-called Cayley transform, defined as
The Cayley transform satisfies several important properties (Wen and Yin, 2013). First, starting from a feasible point, it maintains feasibility over the Stiefel manifold (7b) for all . Second, for small enough , it is guaranteed to follow an ascent direction; combined with line-search, convergence is guaranteed to a stationary point. Finally, given by (16) can be computed efficiently from , thereby avoiding a full matrix inversion, by using the Sherman-Morrison-Woodbury identity (Horn et al., 1990): this results in a complexity for (15), which is significantly faster than eigendecomposion when . In our second approach to updating , we apply (15) rather than eigendecomposition of when adapting iteratively. Both approaches are summarized in line 7 of Alg. 1; we refer to them as KNet and KNet, respectively, in our experiments in Sec. 6.
|HSIC||AE error||NMI||HSIC||AE error||NMI|
6 Experimental Evaluation
Datasets. The datasets we use are summarized in Table 2. The first three datasets (Moon, Spiral1, Spiral2) are synthetic and comprise non-convex clusters; they are shown in Figure 3. Among the remaining four real-life datasets, the features of Breast Cancer (Wolberg, 1992; Mangasarian, 1990) are discrete integer values between 0 to 10. The features of the Wine dataset (Dheeru and Karra Taniskidou, 2017) consist of a mix of real and integer values. The Reuters dataset (RCV) is a collection of news articles labeled by topic. We represent each article via a TFIDF vector using the 500 most frequent words and apply PCA to further reduce the dimension to . The Face dataset (Bay et al., 2000) consists of grey scale, -pixel images of 20 faces in different orientations. We reduce the dimension to
via PCA. As a final preprocessing step, we center and scale all datasets so that their mean is 0 and the standard deviation of each feature is 1.
|Moon||56.2 0.0||42.2 0.0||51.3 20.3||100 0.0||72 0.0||66.1 0.0||100 0.0||100 0.0|
|Spiral1||28.3 0.0||32.02 0.01||59.6 7.5||100.0 0.0||100.0 0.0||42.0 0.0||100.0 0.0||100.0 0.0|
|Cancer||79.9 0.2||79.2 0.0||74.6 2.2||82.9 0.0||69.8 0.0||73 0.0||84.2 0.4||82.5 0.1|
|Wine||54.6 0.01||80.6 0.0||72.3 11.4||79.7 0.2||88 0.0||42.8 0.0||91.0 0.8||90.0 0.7|
|RCV||39.3 0.0||51.3 0.0||39.0 5.5||43.5 0.2||46 0.0||56 0||46.3 0.4||46.1 0.2|
|Face||76.8 0.0||75.8 1.6||83.8 3.5||75.6 0.1||66.0 0.4||91.8 0.0||93 0.3||92.6 0.5|
Clustering Algorithms. We evaluate 8 algorithms, including our two versions of KNet described in Alg. 1. For existing algorithms, we use architecture designs (e.g., depth, width) as recommended by the respective authors during training. We provide details for each algorithm below.
-means: We use the CUDA implementation111https://github.com/src-d/kmcuda by Ding et al. (2015).
SC: We use the python scikit implementation of the classic spectral clustering algorithm by Ng et al. (2002).
AEC: Proposed by Song et al. (2013), this algorithm incorporates a -means objective in an autoencoder. As suggested by the authors, we use 3 hidden layers of width 1000, 250, and 50, respectively, with an output layer dimension of 10.222https://github.com/developfeng/DeepClustering
DEC: Proposed by Xie et al. (2016), this algorithm couples an autoencoder with a soft cluster assignment via a KL-divergence penalty. As recommended, we use 3 hidden layers of width 500, 500, and 2000 with an output layer dimension of 10.333https://github.com/XifengGuo/DEC-keras
IMSAT: Proposed by Hu et al. (2017), this algorithm trains a network adversarially by generating augmented datasets. It uses 2 hidden layers of width 1200 each, the output layer size equals the number of clusters .444https://github.com/weihua916/imsat
SN: Proposed by Shaham et al. (2018), SN uses an objective motivated by spectral clustering to map to a target similarity matrix.555https://github.com/KlugerLab/SpectralNet
KNet and KNet: These are the two versions of KNet, as described in Alg. 1, in which is updated via eigendecomposition and Stiefel Manifold Ascent, respectively. For both versions, the encoder and decoder have 3 layers. For Cancer, Wine, RCV, and Face dataset, we set the width of all hidden layers to . For Moon and Spiral1, we set the width of all hidden layers to 20. We set the Gaussian kernel to be median of the pairwise Euclidean distance between samples in each dataset (see Table 2).
Evaluation Metrics. We evaluate the clustering quality of each algorithm by comparing the clustering assignment generated to the ground truth assignment via the Normalized Mutual Information (NMI). NMI is a similarity metric lying in , with 0 denoting no similarity and 1 as an identical match between the assignments. Originally recommended by Strehl and Ghosh (2002), this statistic has been widely used for clustering quality validation (Niu et al., 2011; Dang and Bailey, 2010; Wu et al., 2018; Ross and Dy, 2013). We provide a formal definition in Appendix A in the supplement.
For each algorithm, we also measure the execution time, separating it into preprocessing time (Prep) and runtime (RT); in doing so, we separately evaluate the cost of, e.g., parameter initialization from training.
Experimental Setup. We execute all algorithms over a machine with 16 dual-core CPUs (Intel Xeon E5-2630 v3 @ 2.40GHz) with 32 GB of RAM with a NVIDIA 1b80 GPU. For methods we can parallelize either over the GPU or over the 16 CPUs (IMSAT,SN,-means,KNet), we ran both executions and recorded the fastest time. The code provided for DEC could only be parallelized over the GPU, while methods AEC and SC could only be executed over the CPUs. For each dataset in Table 4, we run all algorithms on the full dataset 10 times and report the mean and standard deviation of the NMI of their resulting clustering performance against the ground truth. As SGA is randomized, we repeat experiments 10 times and report NMI averages and standard deviations.
For algorithms that can be executed out-of-sample (AEC, DEC, IMSAT, SN, KNET), we repeat the above experiment by training the embedding only on a random subset of the dataset. Subsequently, we apply the trained embedding to the entire dataset and cluster it via -means. For comparison, we also find cluster centers via -means on the subset and new samples to the nearest cluster center. For each dataset, we set the size of the subset (reported in Table 6) so that the spectrum of the resulting subset is close, in the sense, to the spectrum of .
As clustering is unsupervised, we cannot rely on ground truth labels to identify the best hyperparameter. We, therefore, need an unsupervised method for selecting this value. We find that, in practice, selecting works quite well. Because the problem is not convex, local optima reached by KNet depend highly on the initialization. Initializing (a) to approximate the identity map via (11), and (b) to be the spectral embedding indeed leads to a local maximum that is highly dependent on the input , eschewing the need for the reconstruction error in the objective (7); this phenomenon is also observed by Li et al. (2017). We also found that, as anticipated by (6), eliminating the second term produces more convex cluster images.
Table 3 shows the NMI of the Wine dataset for different values of ; tables for additional datasets can be found in Appendix A in the supplement. We also provide the HSIC and AE reconstruction error at convergence. Beyond the good performance of , the table suggests that an alternative unsupervised method is to select so that the ratio of the two terms at convergence is close to one. Of course, this comes at the cost of parameter exploration; in the remainder of this section, we report the performance of KNet with set to 0.
Comparison Against State-of-the-Art. Table 4 shows the NMI performance of different algorithms over different datasets. With the exception of RCV dataset, we see that KNet outperforms every algorithm in Table 4. AEC, DEC, and IMSAT perform especially poorly when encountering non-convex clusters as shown in the first two rows of the table. Spectral clustering (SC), and SN that is also based on a spectral-clustering motivated objective, perform equally well as KNet on discovering non-convex clusters. Nevertheless, KNet outperforms them for real datasets: e.g., for the Face dataset, KNet surpasses SN by 28%. Note that, for the RCV dataset,
-means outperformed all methods, though overall performance is quite poor; a reason for this may be the poor quality of features extracted via TFIDFs and PCA.
KNet’s ability to handle non-convex clusters is evident in the improvement over -means to KNet for the first two datasets. The kernel matrix shown in Fig. 1(b) illustrates why -means performs poorly on this dataset. In contrast, the increasingly convex cluster images learned by KNet, as shown in Fig. 1(c), lead to much better separability. This is consistently observed for both the Moon and Spiral1 dataset, for which KNet achieves NMI; we elaborate on this further in Appendix A. demonstrate KNet’s ability to generate convex representations even when the initial representation is non-convex.
We also note that KNet consistently outperforms spectral clustering. This is worth noting because, as discussed in Sec. 4, KNet’s initialization of both and are tied to the spectral embedding. Table 4 indicates that alternatively learning both the kernel and the corresponding spectral embedding indeed leads to improved clustering performance.
Table 5 shows the time performance of each algorithm. In terms of total time, KNet is faster than AEC and DEC. We also observe that SN is faster than most algorithms in total time. However, SN is the most time intensive algorithm, since it requires extensive hyperparameter tuning to reach the reported NMI performance (see App. A). We note that a significant percentage of the total time for KNet is spent in the preprocessing step, with KNet being faster than KNet. This is due to the initialization of , i.e., training the corresponding autoencoder. Improving this initialization process could dramatically speed up the total runtime. Alternatively, as discussed in the next section, using only a small subset to train the embedding and clustering out-of-sample can also significantly accelerate the total runtime, without a considerable NMI degradation.
Out-of-Sample Clustering. We report out-of-sample clustering NMI performance in Table 6; note that SC cannot be executed out-of-sample. Each algorithm is trained using only a subset of samples, whose size is indicated on the table’s first column. Once trained, we report in the table the clustering quality of applying each algorithm to the full set without retraining. We observe that, with the exception of RCV, KNet clearly outperforms all benchmark algorithms in terms of clustering quality. This implies that KNet is capable of generalizing the results by using as litle as 6% of the data.
By comparing Table 4 against 6, we see that AEC, DEC, and IMSAT suffers a significant quality reduction, while KNet suffer only a maximum degradation of 3%. Therefore, training on a small subset of the data not only yields high-quality results, the results are almost identical to training on the full set itself. Table 7, reporting corresponding times, indicates that this can also lead to a significant acceleration, especially of the preprocessing step. Together, these two observations indicate that KNet can indeed be applied to clustering of large non-convex datasets by training the embedding on only a small subset of the provided samples.
KNet performs unsupervised kernel discovery using only a subset of the data. By discovering a kernel that optimizes the Spectral Clustering objective, it simultaneously discovers an approximation of its embedding through a DNN. Furthermore, experimental results have confirmed that KNet can be trained using only a subset of the data.