Deep Kernel Learning for Clustering

08/09/2019 ∙ by Chieh Wu, et al. ∙ 38

We propose a deep learning approach for discovering kernels tailored to identifying clusters over sample data. Our neural network produces sample embeddings that are motivated by--and are at least as expressive as--spectral clustering. Our training objective, based on the Hilbert Schmidt Information Criterion, can be optimized via gradient adaptations on the Stiefel manifold, leading to significant acceleration over spectral methods relying on eigendecompositions. Finally, our trained embedding can be directly applied to out-of-sample data. We show experimentally that our approach outperforms several state-of-the-art deep clustering methods, as well as traditional approaches such as k-means and spectral clustering over a broad array of real-life and synthetic datasets.



page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Clustering algorithms group similar samples together based on some predefined notion of similarity, often defined through kernels. However, the choice for an appropriate kernel is data-dependent; as a result, the kernel design process is frequently an art that requires intimate knowledge of the data. A common alternative is to simply use a general-purpose kernel that performs well under various conditions (e.g., polynomial or Gaussian kernels).

In this paper, we propose KernelNet (KNet), a methodology for learning a kernel, as well as an induced clustering, directly from the observed data. In particular, we train a deep kernel, combining a neural network representation with a Gaussian kernel. More specifically, given a dataset of samples in , we learn a kernel of the form:

where is an embedding function modeled as a neural network parametrized by , and , , are normalizing constants. Intuitively, incorporating a neural network (NN) parameterization to a Gaussian kernel, we are able to learn a flexible deep kernel for clustering, tailored specifically to the given dataset.

Figure 1: This figure plots out the effect of applying (trained on a small subset) to the full Spiral dataset. The full dataset and its kernel matrix are shown in (a) and (b), respectively. The embedding of the entire dataset and its kernel matrix are shown in (c) and (d), respectively. This figure demonstrates KNet’s ability to generate convex clusters with a block diagonal structure at a large scale while training only on a small subset of the data (in this particular case, 1%).

We train our deep kernel with a spectral clustering objective based on the Hilbert Schmidt Independence Criterion (Gretton et al., 2005)

. Our training process can be interpreted as learning a non-linear transformation

as well as its spectral embedding simultaneously. By appropriately selecting the initial values of our training process, we ensure that our clustering method is at least as powerful as spectral clustering. In particular, just as spectral clustering, our learned kernel and the induced clustering work exceptionally well on non-convex clusters. In practice, by training the kernel directly from the data, our proposed method significantly outperforms spectral clustering.

The non-linear embedding learned directly from the dataset allows us to readily handle out-of-sample data. Given a new sample , not observed before, we can easily identify its cluster label by first computing its image , effectively embedding it to the same space as the (already clustered) existing dataset. This is in contrast to spectral clustering, that would require a re-execution of the algorithm from scratch on the combined dataset of samples.

The aforementioned properties of our algorithm are illustrated in Fig. 1. A dataset of samples in with non-convex spiral clusters is shown in Fig. 1(a). Applying a Gaussian kernel directly to these samples leads to a highly uninformative similarity matrix, as shown in Fig 1(b). We train our embedding on only 1% of the samples, and apply it to the entire dataset; the dataset image, shown in Fig. 1(c), consists of nearly-convex, linearly-separable clusters. Most importantly, the corresponding learned kernel , shown in Fig. 1(d), yields a highly informative similarity matrix that clearly exhibits a block diagonal structure: this indicates high similarity within each cluster and low similarity across clusters.

In summary, we make the following contributions:

  • We propose a novel methodology of discovering a deep kernel tailored for clustering directly from data, using an objective based on the Hilbert-Schmidt Information Criterion.

  • We propose an algorithm for training the kernel by maximizing this objective, as well as for selecting a good parameter initialization. Our algorithm, KNet, can be perceived as alternating between training the kernel and discovering its spectral embedding.

  • We evaluate the performance of KNet with synthetic and real data compared to multiple state-of-the-art methods for deep clustering. In 5 out 6 datasets, KNet outperforms state-of-the-art by as much as 57.8%; this discrepancy is more pronounced in datasets with non-convex clusters, which KNet handles very well.

  • Finally, we demonstrate that the algorithm has exceptional performance in clustering out-of-sample data. We exploit this to show that KNet can be significantly accelerated through subsampling: learning the embedding on only 1%-35% of the data can be used to cluster an entire dataset, leading only to a 0%-3% degradation of clustering performance.

The remainder of the paper is organized as follows. In Sec. 2, we discuss related work. In Sec. 3, we provide a brief description of the Hilbert-Schmidt Information Criterion and its relationship to clustering. The formal definition of the problem we solve and our proposed algorithm (KNet) are discussed in Sec. 4 and Sec 5, respectively. Sec. 6 contains our experiments. Finally, we conclude in Sec. 7.

2 Related Work

Several recent works propose autoencoders specifically designed for clustering.

Song et al. (2013) combine an autoencoder with -means, including an

-penalty w.r.t. distance to cluster centers, obtained by alternating between stochastic gradient descent (SGD) and cluster center assignment.

Ji et al. (2017) incorporate a subspace clustering penalty to an autoencoder, and alternate between SGD and dictionary learning. Tian et al. (2014) learn a stacked autoencoder initialized via a similarity matrix. Xie et al. (2016) incorporate a KL-divergence penalty between the encoding and a soft cluster assignment, both of which are again alternately optimized; a similar approach is followed by Guo et al. (2017) and Hu et al. (2017). In KNet, we significantly depart from these methods by using an HSIC-based objective, motivated by spectral clustering. In practice, this makes KNet better tailored to learning non-convex clusters, on which the aforementioned techniques perform poorly: we demonstrate this experimentally in Section 6.

Our work is closest to Shaham et al. (2018), that use an objective motivated by spectral clustering to learn an embedding highly correlated to a given similarity matrix; this makes the embedding only as good as the chosen similarity. The resulting optimization corresponds to Eq. (6) in our paper: KNet can thus be seen as a generalization of Shaham et al. (2018) that learns both the embedding and the similarity matrix jointly; this eschews the need for discovering the right similarity tailored to the dataset, and leads to overall improved performance (see also Sec. 6).

KNet also has a direct relationship to methods for kernel learning. A series of papers (Wilson et al., 2011, 2016b, 2016a) regress deep kernels to model Gaussian processes. Zhou et al. (2004) learn (shallow) linear combinations of given kernels. Closest to us, Niu et al. (2011) use HSIC to jointly discover a subspace in which data lives as well as its spectral embedding; the latter is used to cluster the data. This corresponds to learning a kernel over a (shallow) projection of the data to a linear subspace. KNet, therefore, generalizes the work by Niu et al. (2011) to learning a deep, non-linear kernel representation (c.f. Eq. (9)), which improves upon spectral embeddings and is used directly to cluster the data.

3 Hilbert-Schmidt Independence Criterion

Proposed by Gretton et al. (2005)

, the Hilbert Schmidt Independence Criterion (HSIC) is a statistical dependence measure between two random variables. Like Mutual Information (MI), it measures dependence via the distance between the joint distribution of the two random variables against the product of their individual distributions. However, compared to MI, HSIC is easier to compute empirically, since it does not require a direct estimation of the joint distribution. It is used in many applications due to this advantage, including dimensionality reduction

(Niu et al., 2011)

, feature selection

(Song et al., 2007), and alternative clustering (Wu et al., 2018), to name a few.

Formally, consider a set of i.i.d. samples , , where , are drawn from a joint distribution. Let and be the corresponding matrices comprising a sample in each row. Let also be the Gaussian kernel and be the linear kernel . Define to be the kernel matrices with entries and , respectively, and let be the normalized Gaussian kernel matrix given by


where the degree matrix


is a normalizing diagonal matrix. Then, the HSIC between and is estimated empirically via:


where is a centering matrix.

Intuitively, HSIC empirically measures the dependence between samples of the two random variables. Though HSIC can be more generally defined for arbitrary kernels, this particular choice has a direct relationship with (and motivation from) spectral clustering. In particular, given , consider the optimization:

subject to (4b)

where is given by (3). Then, the optimal solution to (4) is precisely the spectral embedding of (Niu et al., 2011). Indeed, comprises the top eigenvectors of the Laplacian of , given by:


For completeness, we prove this in Appendix A.

the kernel function discovered by KNet
number of samples in a dataset
number of clusters
sample dimension
, data samples in
, data matrices in , where each row represents a single sample
Gaussian kernel
Linear kernel
Gaussian kernel parameter
, kernel matrices computed on ,
normalized Gaussian kernel matrix, given by (1)
degree matrix, given by (2)
empirical estimate of HSIC, given by (3)
centering matrix

a vector of length

containing 1s in all entries
the Laplacian of spectral clustering, given by (5)
spectral embedding of , obtained via eigendecomposition of or as solution to (4)
the weights of the encoder
the weights of the decoder
embedding for a single sample
embedding for the entire dataset
decoder portion of the autoencoder
autoencoder, given by (8)
spectral embedding of
pair ( and )
Objective of optimization problem (7)
the Laplacian of , given by (13)
-th diagonal element of degree matrix
Table 1: Notation Summary.

4 Problem Formulation

We are given a dataset of samples grouped in (potentially) non-convex clusters. Our objective is to cluster samples by first embedding them into a space in which the clusters become convex. Given such an embedding, clusters can subsequently be identified via, e.g., -means. We would like the embedding, modeled as a neural network, to be at least as expressive as spectral clustering: clusters separable by spectral clustering should also become separable via our embedding. In addition, the embedding should generalize to out-of-sample data, thereby enabling us to cluster new samples outside the original dataset.

Learning the Embedding and a Deep Kernel. Formally, we wish to identify clusters over a dataset of samples and features. Let be an embedding of a sample to , modeled as a DNN parametrized by ; we denote by the image of under parameters . We also denote by the embedding of the entire dataset induced by , and use for the image of .

Let be the spectral embedding of , obtained via spectral clustering. We can train to induce similar behavior as via the following optimization:


where is given by Eq. (3). Intuitively, seeing HSIC as a dependence measure, by training so that is maximally dependent on , becomes a surrogate to the spectral embedding, sharing similar properties.

However, the surrogate learned via (6) may only be as discriminative as . To address this issue, we depart from (6) by jointly discovering both as well as a coupled spectral embedding . In particular, we solve the following optimization problem w.r.t. both the embedding and :

subj. to: (7b)

where the unknowns are and , and


is an autoencoder, comprising and as an encoder and decoder respectively. The autoencoder ensures that the embedding is representative of the input . To gain some intuition into how problem (7) generalizes (6), observe that if the embedding is fixed to be the identity map (i.e., for ) then, by Eq. (4), optimizing only for produces the spectral embedding . The joint optimization of both and allows us to further improve upon , as well as on the coupled ; we demonstrate experimentally in Section 6 that this significantly improves clustering quality.

The optimization (7) can also be interpreted as an instance of kernel learning. Indeed, as discussed in the introduction, by learning , we discover in effect a normalized kernel of the form


where are the corresponding diagonal elements of degree matrix .

Out-of-Sample Data. The embedding can readily be applied to clustering out-of-sample data. In particular, having trained over dataset , given a new dataset , we can cluster this new dataset efficiently as follows. First, we use the pre-trained to map every sample in to its image, producing : this effectively embeds to the same space as . From this point on, clusters can be recomputed efficiently via, e.g., -means, or by mapping the images to the closest existing cluster head. In contrast to, e.g., spectral clustering, this avoids recomputing the joint embedding of the entire dataset from scratch.

The ability to handle out-of-sample data can be leveraged to also accelerate training. In particular, given the original dataset , computation can be sped up by training the embedding by solving (7) on a small subset of . The resulting trained can be used to embed, and subsequently cluster, the entire dataset. We show in Section 6 that this approach works very well, leading to a significant acceleration in computations without degrading clustering quality.

Convex Cluster Images. The first term in objective  (7a) naturally encourages to form convex clusters. To see this, observe that, ignoring the reconstruction error, the objective (7a) becomes:


where are the elements of matrix . The exponential terms in Eq. (10) compel samples under which to become attracted to each other, while samples for which drift farther apart. This is illustrated in Figure 2. Linearly separable, increasingly convex cluster images arise over several iterations of solving our algorithm at Eq. (7). The algorithm, KNet, is described in the next section.

Figure 2:

Cluster images after several epochs of stochastic gradient descent over (

7a). The first term of objective (7a) pulls cluster images further apart, while making them increasingly convex. Note that this happens in a fully unsupervised fashion.

5 KNet Algorithm

We solve optimization problem (7) by iteratively adapting and . In particular, we initialize to be the identity map, and to be the spectral embedding . We subsequently alternate between adapting and We adapt via stochastic gradient ascent (SGA). To optimize , we adopt two approaches: one based on eigendecomposition, and one based on optimization over the Stiefel manifold. We describe each of these steps in detail below; a summary can be found in Algorithm 1.

Initialization. The non-convexity of (7) necessitates a principled approach for selecting good initialization points for and . We initialize to , computed via the top- eigenvectors of the Laplacian of , given by (5). We initialize so that is the identity map; This is accomplished by pre-training , via SGD as solutions to:


Note that, in this construction, we use .

1:  Input: data
2:  Output: and clustering labels
3:  Initialization: Initialize via (11)Initialize with
4:  repeat
5:     Update via one epoch of SGA over (7a), holding and fixed
6:     Update via Eq. (2)
7:     KNet: Update via eigendecomposition of Laplacian (13), or KNet: Update via Stiefler Manifold Ascent (15).
8:  until  has converged
9:  Run -means on
Algorithm 1 KNet Algorithm

Updating . A simple way to update is via a gradient ascent, i.e.:


for , where is the objective (7a). In practice, we wish to apply stochastic gradient ascent over mini-batches; for fixed, objective (7a) reduces to (10); however, the terms in the sum are coupled via the normalizing degree matrix , which depends on via (2). This significantly increases the cost of computing mini-batch gradients. To simplify this computation, instead we hold both and fixed, and update via one epoch of SGA over (7a). At the conclusion of one epoch, we update the Gaussian kernel and the corresponding degree matrix via Eq. (2

). We implemented both this heuristic and regular SGA, and found that it led to a significant speedup without any observable degradation in clustering performance (see also Section 


Updating via Eigendecomposition. Our first approach to adapting relies on the fact that, holding constant, problem (7) reduces to the form (4). That is, at each iteration, for fixed, the optimal solution

is given by the top eigenvectors of Laplacian


Hence, given at an iteration, we update by returning . When , there are several efficient algorithms for computing the top eigenvectors–see, e.g., Fowlkes et al. (2004); Vladymyrov and Carreira-Perpiñán (2016).

Updating via Stiefel Manifold Ascent. The manifold in defined by constraint (7b), a.k.a. the Stiefel Manifold, is not convex; nevertheless, there is an extensive literature on optimization techniques over this set (Absil et al., 2008; Wen and Yin, 2013), exploiting the fact that descent directions that maintain feasibility can be computed efficiently. In particular, following Wen and Yin (2013), treating as a constant, and given a feasible and the gradient of the objective w.r.t , define


Using and a predefined step length , the maximization proceeds iteratively via:


where is the so-called Cayley transform, defined as


The Cayley transform satisfies several important properties (Wen and Yin, 2013). First, starting from a feasible point, it maintains feasibility over the Stiefel manifold (7b) for all . Second, for small enough , it is guaranteed to follow an ascent direction; combined with line-search, convergence is guaranteed to a stationary point. Finally, given by (16) can be computed efficiently from , thereby avoiding a full matrix inversion, by using the Sherman-Morrison-Woodbury identity (Horn et al., 1990): this results in a complexity for (15), which is significantly faster than eigendecomposion when . In our second approach to updating , we apply (15) rather than eigendecomposition of when adapting iteratively. Both approaches are summarized in line 7 of Alg. 1; we refer to them as KNet and KNet, respectively, in our experiments in Sec. 6.

Data Type
Moon 1000 2 2 Geometric Shape 0.1701
Spiral1 3000 3 2 Geometric Shape 0.1708
Spiral2 30000 3 2 Geometric Shape 0.1811
Cancer 683 2 9 Medical 3.3194
Wine 178 3 13 Classification 4.939
RCV 10000 4 5 Text 2.364
Face 624 20 27 Image 6.883
Table 2: Dataset Summary.
KNet KNet
98.11 21.582 0.9 92.01 8.98 0.89
98.805 72.952 0.88 94.211 72.619 0.87
112.321 101.449 0.87 101.11 88.393 0.89
0.005 109.008 105.369 0.908 108.222 107.221 0.903
113.178 124.56 0.91 110.182 126.63 0.91
0 110.895 127.23 0.924 109.363 127.11 0.908
Table 3: HSIC, AE reconstruction error, and NMI at convergence of KNet on the Wine dataset, as a function of .

6 Experimental Evaluation

Datasets. The datasets we use are summarized in Table 2. The first three datasets (Moon, Spiral1, Spiral2) are synthetic and comprise non-convex clusters; they are shown in Figure 3. Among the remaining four real-life datasets, the features of Breast Cancer  (Wolberg, 1992; Mangasarian, 1990) are discrete integer values between 0 to 10. The features of the Wine dataset  (Dheeru and Karra Taniskidou, 2017) consist of a mix of real and integer values. The Reuters dataset (RCV) is a collection of news articles labeled by topic. We represent each article via a TFIDF vector using the 500 most frequent words and apply PCA to further reduce the dimension to . The Face dataset  (Bay et al., 2000) consists of grey scale, -pixel images of 20 faces in different orientations. We reduce the dimension to

via PCA. As a final preprocessing step, we center and scale all datasets so that their mean is 0 and the standard deviation of each feature is 1.

Figure 3: Synthetic Datasets. Both dataset contain non-convex clusters. Dataset Spiral2 (depicted) contains samples, while Spiral1 contains a sub-sampled version of .
Dataset AEC DEC IMSAT SN SC -means KNet KNet
Moon 56.2 0.0 42.2 0.0 51.3 20.3 100 0.0 72 0.0 66.1 0.0 100 0.0 100 0.0
Spiral1 28.3 0.0 32.02 0.01 59.6 7.5 100.0 0.0 100.0 0.0 42.0 0.0 100.0 0.0 100.0 0.0
Cancer 79.9 0.2 79.2 0.0 74.6 2.2 82.9 0.0 69.8 0.0 73 0.0 84.2 0.4 82.5 0.1
Wine 54.6 0.01 80.6 0.0 72.3 11.4 79.7 0.2 88 0.0 42.8 0.0 91.0 0.8 90.0 0.7
RCV 39.3 0.0 51.3 0.0 39.0 5.5 43.5 0.2 46 0.0 56 0 46.3 0.4 46.1 0.2
Face 76.8 0.0 75.8 1.6 83.8 3.5 75.6 0.1 66.0 0.4 91.8 0.0 93 0.3 92.6 0.5
Table 4: The clustering result measured by NMI as percentages are shown above where the best mean results are highlighted in bold text. Besides the RCV dataset, KNet generally outperforms competing methods by a significant margin. The improvement is especially large with the Moon and Spiral1 dataset due to KNet’s ability to identify non-convex clusters.
Dataset Prep RT Prep RT Prep RT RT RT RT Prep RT RT
Moon 18.23s 28.04s 5.53m 3.52s 1.2s 34.34s 5.4m 0.34s 0.043s 129s 28.9s 18.4s
Spiral1 42.9m 2.2 h 5.73m 2.02m 53.06s 5.15m 7.4m 4.2s 0.12s 5.0m 42s 19s
Cancer 1.65m 4.67m 5.7m 38.73s 1.1s 2.4m 3.9m 0.18s 0.03s 150s 19.3s 10.3s
Wine 33.81s 41.69s 5.75m 38.02s 1.1s 33.32s 5.5m 0.03s 0.06s 7.7m 7.2s 3.4s
RCV 2.36m 42.28m 6.35m 13.07m 10.39s 1.22m 7.3m 83.4s 0.35s 18m 30.5m 18.6m
Face 27.5s 3.15m 5.74m 4.235m 1.2s 2.03m 170.9s 0.26s 0.15s 22m 20.9s 3.3s
Table 5: The preprocessing (Prep) and runtime (RT) for all benchmark algorithms are displayed in seconds (s), minutes (m) and hours (h). The table demonstrates that KNet’s speed is comparable to competing methods.

Clustering Algorithms. We evaluate 8 algorithms, including our two versions of KNet described in Alg. 1. For existing algorithms, we use architecture designs (e.g., depth, width) as recommended by the respective authors during training. We provide details for each algorithm below.

-means: We use the CUDA implementation111 by Ding et al. (2015).
SC: We use the python scikit implementation of the classic spectral clustering algorithm by Ng et al. (2002).
AEC: Proposed by Song et al. (2013), this algorithm incorporates a -means objective in an autoencoder. As suggested by the authors, we use 3 hidden layers of width 1000, 250, and 50, respectively, with an output layer dimension of 10.222
DEC: Proposed by Xie et al. (2016), this algorithm couples an autoencoder with a soft cluster assignment via a KL-divergence penalty. As recommended, we use 3 hidden layers of width 500, 500, and 2000 with an output layer dimension of 10.333
IMSAT: Proposed by Hu et al. (2017), this algorithm trains a network adversarially by generating augmented datasets. It uses 2 hidden layers of width 1200 each, the output layer size equals the number of clusters .444
SN: Proposed by Shaham et al. (2018), SN uses an objective motivated by spectral clustering to map to a target similarity matrix.555
KNet and KNet: These are the two versions of KNet, as described in Alg. 1, in which is updated via eigendecomposition and Stiefel Manifold Ascent, respectively. For both versions, the encoder and decoder have 3 layers. For Cancer, Wine, RCV, and Face dataset, we set the width of all hidden layers to . For Moon and Spiral1, we set the width of all hidden layers to 20. We set the Gaussian kernel to be median of the pairwise Euclidean distance between samples in each dataset (see Table 2).

Evaluation Metrics. We evaluate the clustering quality of each algorithm by comparing the clustering assignment generated to the ground truth assignment via the Normalized Mutual Information (NMI). NMI is a similarity metric lying in , with 0 denoting no similarity and 1 as an identical match between the assignments. Originally recommended by Strehl and Ghosh (2002), this statistic has been widely used for clustering quality validation (Niu et al., 2011; Dang and Bailey, 2010; Wu et al., 2018; Ross and Dy, 2013). We provide a formal definition in Appendix A in the supplement.

For each algorithm, we also measure the execution time, separating it into preprocessing time (Prep) and runtime (RT); in doing so, we separately evaluate the cost of, e.g., parameter initialization from training.

Experimental Setup. We execute all algorithms over a machine with 16 dual-core CPUs (Intel Xeon E5-2630 v3 @ 2.40GHz) with 32 GB of RAM with a NVIDIA 1b80 GPU. For methods we can parallelize either over the GPU or over the 16 CPUs (IMSAT,SN,-means,KNet), we ran both executions and recorded the fastest time. The code provided for DEC could only be parallelized over the GPU, while methods AEC and SC could only be executed over the CPUs. For each dataset in Table 4, we run all algorithms on the full dataset 10 times and report the mean and standard deviation of the NMI of their resulting clustering performance against the ground truth. As SGA is randomized, we repeat experiments 10 times and report NMI averages and standard deviations.

For algorithms that can be executed out-of-sample (AEC, DEC, IMSAT, SN, KNET), we repeat the above experiment by training the embedding only on a random subset of the dataset. Subsequently, we apply the trained embedding to the entire dataset and cluster it via -means. For comparison, we also find cluster centers via -means on the subset and new samples to the nearest cluster center. For each dataset, we set the size of the subset (reported in Table 6) so that the spectrum of the resulting subset is close, in the sense, to the spectrum of .

Dataset Data % AEC DEC IMSAT SN -means KNet KNet
Moon 25% 51.05 45.6 45.5 100 21.5 100 100
Spiral1 10% 32.2 49.5 48.7 100 56.7 100 100
Cancer 30% 76.4 76.9 74.9 82.2 Fails 84.0 83.6
Wine 75% 49.1 81.5 69.4 77.2 25.0 91.1 89.3
RCV 6% 26.78 43.23 35.2 41.3 52.5 45.2 43.1
Face 35% 52.5 67.1 77.8 75.5 87.2 92.7 91.1
Table 6: The out-of-sample clustering result measured by NMI as percentages are shown above where the best mean results are highlighted in bold text. All algorithms are trained on a subset of data. We report the results of the total dataset clustered out-of-sample via each algorithm.
Data Prep RT Prep RT Prep RT RT RT Prep RT RT
Moon 15.01s 22.34s 3.35m 2.95s 1.1s 27.63s 140.1s 0.034s 48s 2.3s 1.1s
Spiral1 3.2m 8.31m 4.17m 1.65m 47.05s 3.83m 223.3s 0.071s 82s 3.1s 1.4s
Cancer 55.3s 1.25m 5.1m 30.01s 1.65s 1.67m 38.9s 0.03s 16s 4s 1.8s
Wine 25.3s 35.31s 4.65m 32.31s 2.4s 25.6s 20.4s 0.03s 74s 1.3s 0.089s
RCV 63s 5.27m 5.65m 11.21m 3.21s 56s 40.67s 0.03s 1.2m 5.5s 2.1s
Face 15.1s 2.86m 5.25m 3.89m 1.1s 1.56m 35.09s 0.101s 76.8 2.2 0.96s
Table 7: The out-of-sample preprocessing (Prep) and runtime (RT) for all benchmark algorithms are displayed in seconds (s), minutes (m) and hours (h). The table demonstrates that KNet’s speed is comparable to competing methods.

6.1 Results

Selecting .

As clustering is unsupervised, we cannot rely on ground truth labels to identify the best hyperparameter

. We, therefore, need an unsupervised method for selecting this value. We find that, in practice, selecting works quite well. Because the problem is not convex, local optima reached by KNet depend highly on the initialization. Initializing (a) to approximate the identity map via (11), and (b) to be the spectral embedding indeed leads to a local maximum that is highly dependent on the input , eschewing the need for the reconstruction error in the objective (7); this phenomenon is also observed by Li et al. (2017). We also found that, as anticipated by (6), eliminating the second term produces more convex cluster images.

Table 3 shows the NMI of the Wine dataset for different values of ; tables for additional datasets can be found in Appendix A in the supplement. We also provide the HSIC and AE reconstruction error at convergence. Beyond the good performance of , the table suggests that an alternative unsupervised method is to select so that the ratio of the two terms at convergence is close to one. Of course, this comes at the cost of parameter exploration; in the remainder of this section, we report the performance of KNet with set to 0.

Comparison Against State-of-the-Art. Table 4 shows the NMI performance of different algorithms over different datasets. With the exception of RCV dataset, we see that KNet outperforms every algorithm in Table 4. AEC, DEC, and IMSAT perform especially poorly when encountering non-convex clusters as shown in the first two rows of the table. Spectral clustering (SC), and SN that is also based on a spectral-clustering motivated objective, perform equally well as KNet on discovering non-convex clusters. Nevertheless, KNet outperforms them for real datasets: e.g., for the Face dataset, KNet surpasses SN by 28%. Note that, for the RCV dataset,

-means outperformed all methods, though overall performance is quite poor; a reason for this may be the poor quality of features extracted via TFIDFs and PCA.

KNet’s ability to handle non-convex clusters is evident in the improvement over -means to KNet for the first two datasets. The kernel matrix shown in Fig. 1(b) illustrates why -means performs poorly on this dataset. In contrast, the increasingly convex cluster images learned by KNet, as shown in Fig. 1(c), lead to much better separability. This is consistently observed for both the Moon and Spiral1 dataset, for which KNet achieves NMI; we elaborate on this further in Appendix A. demonstrate KNet’s ability to generate convex representations even when the initial representation is non-convex.

We also note that KNet consistently outperforms spectral clustering. This is worth noting because, as discussed in Sec. 4, KNet’s initialization of both and are tied to the spectral embedding. Table 4 indicates that alternatively learning both the kernel and the corresponding spectral embedding indeed leads to improved clustering performance.

Table 5 shows the time performance of each algorithm. In terms of total time, KNet is faster than AEC and DEC. We also observe that SN is faster than most algorithms in total time. However, SN is the most time intensive algorithm, since it requires extensive hyperparameter tuning to reach the reported NMI performance (see App. A). We note that a significant percentage of the total time for KNet is spent in the preprocessing step, with KNet being faster than KNet. This is due to the initialization of , i.e., training the corresponding autoencoder. Improving this initialization process could dramatically speed up the total runtime. Alternatively, as discussed in the next section, using only a small subset to train the embedding and clustering out-of-sample can also significantly accelerate the total runtime, without a considerable NMI degradation.

Out-of-Sample Clustering. We report out-of-sample clustering NMI performance in Table 6; note that SC cannot be executed out-of-sample. Each algorithm is trained using only a subset of samples, whose size is indicated on the table’s first column. Once trained, we report in the table the clustering quality of applying each algorithm to the full set without retraining. We observe that, with the exception of RCV, KNet clearly outperforms all benchmark algorithms in terms of clustering quality. This implies that KNet is capable of generalizing the results by using as litle as 6% of the data.

By comparing Table 4 against 6, we see that AEC, DEC, and IMSAT suffers a significant quality reduction, while KNet suffer only a maximum degradation of 3%. Therefore, training on a small subset of the data not only yields high-quality results, the results are almost identical to training on the full set itself. Table 7, reporting corresponding times, indicates that this can also lead to a significant acceleration, especially of the preprocessing step. Together, these two observations indicate that KNet can indeed be applied to clustering of large non-convex datasets by training the embedding on only a small subset of the provided samples.

7 Conclusions

KNet performs unsupervised kernel discovery using only a subset of the data. By discovering a kernel that optimizes the Spectral Clustering objective, it simultaneously discovers an approximation of its embedding through a DNN. Furthermore, experimental results have confirmed that KNet can be trained using only a subset of the data.


  • P.-A. Absil, R. Mahony, and R. Sepulchre (2008) Optimization algorithms on matrix manifolds. Princeton University Press, Princeton, NJ. Cited by: §5.
  • S. D. Bay, D. Kibler, M. J. Pazzani, and P. Smyth (2000) The uci kdd archive of large data sets for data mining research and experimentation. ACM SIGKDD Explorations Newsletter 2 (2), pp. 81–85. Cited by: §6.
  • X. H. Dang and J. Bailey (2010) Generation of alternative clusterings using the cami approach. In Proceedings of the 2010 SIAM International Conference on Data Mining, pp. 118–129. Cited by: §6.
  • D. Dheeru and E. Karra Taniskidou (2017) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: §6.
  • Y. Ding, Y. Zhao, X. Shen, M. Musuvathi, and T. Mytkowicz (2015) Yinyang k-means: a drop-in replacement of the classic k-means with consistent speedup. In International Conference on Machine Learning, pp. 579–587. Cited by: §6.
  • C. Fowlkes, S. Belongie, F. Chung, and J. Malik (2004) Spectral grouping using the nystrom method. IEEE transactions on pattern analysis and machine intelligence 26 (2), pp. 214–225. Cited by: §5.
  • A. Gretton, O. Bousquet, A. Smola, and B. Schölkopf (2005) Measuring statistical dependence with hilbert-schmidt norms. In International Conference on Algorithmic Learning Theory, pp. 63–77. Cited by: §1, §3.
  • X. Guo, X. Liu, E. Zhu, and J. Yin (2017) Deep clustering with convolutional autoencoders. In International Conference on Neural Information Processing, pp. 373–382. Cited by: §2.
  • R. A. Horn, R. A. Horn, and C. R. Johnson (1990) Matrix analysis. Cambridge university press. Cited by: §5.
  • W. Hu, T. Miyato, S. Tokui, E. Matsumoto, and M. Sugiyama (2017) Learning discrete representations via information maximizing self augmented training. arXiv preprint arXiv:1702.08720. Cited by: §2, §6.
  • P. Ji, T. Zhang, H. Li, M. Salzmann, and I. Reid (2017) Deep subspace clustering network. In Advances in Neural Information Processing Systems, Cited by: §2.
  • C. Li, W. Chang, Y. Cheng, Y. Yang, and B. Póczos (2017)

    MMD GAN: towards deeper understanding of moment matching network

    In Advances in Neural Information Processing Systems, pp. 2203–2213. Cited by: §6.1.
  • O. L. Mangasarian (1990)

    Cancer diagnosis via linear programming

    SIAM news 23 (5), pp. 18. Cited by: §6.
  • A. Y. Ng, M. I. Jordan, and Y. Weiss (2002)

    On spectral clustering: analysis and an algorithm

    In Advances in Neural Information Processing Systems, pp. 849–856. Cited by: §6.
  • D. Niu, J. Dy, and M. Jordan (2011) Dimensionality reduction for spectral clustering. In

    Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics

    pp. 552–560. Cited by: Appendix A, §2, §3, §3, §6.
  • J. Ross and J. Dy (2013) Nonparametric mixture of gaussian processes with constraints. In International Conference on Machine Learning, pp. 1346–1354. Cited by: §6.
  • U. Shaham, K. Stanton, H. Li, R. Basri, B. Nadler, and Y. Kluger (2018) SpectralNet: spectral clustering using deep neural networks. In International Conference on Learning Representations, External Links: Link Cited by: §2, §6.
  • C. Song, F. Liu, Y. Huang, L. Wang, and T. Tan (2013) Auto-encoder based data clustering. In

    Iberoamerican Congress on Pattern Recognition

    pp. 117–124. Cited by: §2, §6.
  • L. Song, A. Smola, A. Gretton, K. M. Borgwardt, and J. Bedo (2007) Supervised feature selection via dependence estimation. In Proceedings of the 24th international conference on Machine learning, pp. 823–830. Cited by: §3.
  • A. Strehl and J. Ghosh (2002) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. Journal of machine learning research 3 (Dec), pp. 583–617. Cited by: §6.
  • F. Tian, B. Gao, Q. Cui, E. Chen, and T. Liu (2014) Learning deep representations for graph clustering.. In AAAI, pp. 1293–1299. Cited by: §2.
  • M. Vladymyrov and M. Carreira-Perpiñán (2016) The variational nystrom method for large-scale spectral problems. In International Conference on Machine Learning, pp. 211–220. Cited by: §5.
  • Z. Wen and W. Yin (2013) A feasible method for optimization with orthogonality constraints. Mathematical Programming 142 (1-2), pp. 397–434. Cited by: §5.
  • A. G. Wilson, Z. Hu, R. R. Salakhutdinov, and E. P. Xing (2016a) Stochastic variational deep kernel learning. In Advances in Neural Information Processing Systems, pp. 2586–2594. Cited by: §2.
  • A. G. Wilson, Z. Hu, R. Salakhutdinov, and E. P. Xing (2016b) Deep kernel learning. In Artificial Intelligence and Statistics, pp. 370–378. Cited by: §2.
  • A. G. Wilson, D. A. Knowles, and Z. Ghahramani (2011) Gaussian process regression networks. arXiv preprint arXiv:1110.4411. Cited by: §2.
  • W. H. Wolberg (1992) Wisconsin breast cancer dataset. University of Wisconsin Hospitals (), pp. . Cited by: §6.
  • C. Wu, S. Ioannidis, M. Sznaier, X. Li, D. Kaeli, and J. Dy (2018) Iterative spectral method for alternative clustering. In International Conference on Artificial Intelligence and Statistics, pp. 115–123. Cited by: §3, §6.
  • J. Xie, R. Girshick, and A. Farhadi (2016) Unsupervised deep embedding for clustering analysis. In International Conference on Machine Learning, pp. 478–487. Cited by: §2, §6.
  • D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Schölkopf (2004) Learning with local and global consistency. In Advances in neural information processing systems, pp. 321–328. Cited by: §2.

Appendix A Relating HSIC to Spectral Clustering


Using Eq. (3) to compute HSIC emprically, Eq (4) can be rewritten as


where and are the kernel matrices computed from and . As shown by Niu et al. (2011), if we let be a linear kernel such that , add the constraint such that and rotate the trace terms we get

s.t : (18b)

By setting the Laplacian as , the formulation becomes identical to Spectral Clustering as

s.t : (19b)

Appendix A Effect of on HSIC, AE reconstruction error, and NMI

KNet KNet
2.989 0.001 0.446 2.989 0 0.421
2.989 0.0001 0.412 2.989 0 0.414
2.989 0.0001 0.421 2.989 0 0.418
2.989 0.001 0.434 2.989 0 0.419
2.988 0.001 0.421 2.988 0 0.418
2.965 0.099 0.557 2.979 0.033 0.558
1.982 0.652 0.644 2.33 0.583 0.56
2.124 0.669 0.969 2.037 0.66 0.824
2.249 0.69 1 2.368 0.67 1
0 2.278 0.715 1 2.121 0.718 1
Table 8: HSIC, AE reconstruction error, and NMI at convergence of KNet on the SPIRAL1 dataset, as a function of .

Appendix A Normalized Mutual Information

Consider two clustering assignments assigning labels in to samples in dataset . We represent these two assignments through two partitions of , namely , , where

are the sets of samples receiving label under each assignment. Define the empirical distribution of labels to be:


The NMI is then given by the ratio

where is the mutual information and are the entropies of the marginals of the joint distribution (20).

Appendix A Moon and Spiral dataset

In this appendix, we illustrate that for the Moon and the Spiral datasets, we are able to learn (a) convex images for the clusters and (b) kernels that produces block diagonal structures. The kernel matrix is constructed with a Gaussian kernel, therefore the values are between 0 to 1. The kernel matrices shown in the figures below use white as 0 and dark blue as 1; all values in between are shown as a gradient between the two colors.

In Figure 4, the Moon dataset is plotted in Fig. 4(a) and its kernel block structure in Fig. 4(b). After training , the image of is shown in Fig. 4(c) along with its block diagonal structure in Fig. 4(d). Using the same trained on , we distorted with Gaussian noise and plot it in Fig. 5(a) along with its kernel matrix in Fig. 5(b). We then pass the distorted into and plot the resulting image in Fig. 5(c) along with its kernel matrix in Fig. 5(d). From this example, we demonstrate KNet’s ability to embed data into convex clusters even under Gaussian noise.

In Figure 6, a subset of 300 samples of the Spiral dataset is plotted in Fig. 6(a) and its kernel block structure in Fig. 6(b). After training , the image of is shown in Fig. 6(c) along with its block diagonal structure in Fig. 6(d). The full dataset is shown in (a) of Figure 7 along with its kernel matrix in Fig. 7(b). Using the same trained from Fig. 6(a), we pass the full dataset into and plot the resulting image in Fig. 6(b) along with its kernel matrix in Fig. 6(d). From this example, we demonstrate KNet’s ability to generalize convex embedding using only 1% of the data.

Figure 4: This figure plots out the effect of training on the Moon dataset. The original data and its kernel matrix is shown in (a) and (b) respectively. The embedding of the data with and its kernel matrix is shown in (c) and (d) respectively. This figure demonstrates KNet’s ability to maps non-convex clustering into convex representation.
Figure 5: This figure plots out the effect of applying a trained on a distorted Moon dataset. The distorted data and its kernel matrix is shown in (a) and (b) respectively. The embedding of the distorted data with and its kernel matrix is shown in (c) and (d) respectively. This figure demonstrates KNet’s robustness in mapping non-convex clustering into convex representation under Gaussian distortion.
Figure 6: This figure plots out the effect of training and applying on a small subset of the Spiral dataset. The subset and its kernel matrix is shown in (a) and (b) respectively. The embedding of the subset and its kernel matrix is shown in (c) and (d) respectively. This figure demonstrates KNet’s ability to map non-convex cluster into convex representation with only a small subset of the data.
Figure 7: This figure plots out the effect of applying trained on a small subset to the full Spiral dataset. The full dataset and its kernel matrix is shown in (a) and (b) respectively. The embedding of the full dataset and its kernel matrix is shown in (c) and (d) respectively. This figure demonstrates KNet’s ability to generate convex representation on a large scale while training only on a small subset of the data. In this particular case, 1%.

Appendix A Algorithm Hyperparameter Details

Algorithm Learning Rate Batch size Optimizer
AEC - 100 SGD
DEC 0.01 256 SGDSolver
IMSAT 0.002 250 Adam
SN 0.001 128 RMSProp
KNet 0.001 5 Adam
Table 9: Here we include the typically used learning rates and batch sizes, and the optimizer type for each algorithm. These were set as recommended by the respective papers, except in the case of AEC which is silent on what learning rate needs to be set, the available implementation sets the learning rate with a line search. We use the above mentioned settings generally and only change them if the batch size is too big for a dataset or we notice the preset learning rate not leading to convergence.

The hyperparameters for each network are set as outlined in the respective papers. In the case of SN, the hyper-parameters included the number of neighbors for calculation, the number of neighbors to use for graph Laplacian affinity matrix, the number of neighbors to use to calculate the scale of gaussian graph laplacian, and the threshold for calculating the closest neighbors in the Siamese network. They were set by doing a grid search over values ranging from 2 to 10 and by using the loss over

of the training data.