1 Introduction
Clustering algorithms group similar samples together based on some predefined notion of similarity, often defined through kernels. However, the choice for an appropriate kernel is datadependent; as a result, the kernel design process is frequently an art that requires intimate knowledge of the data. A common alternative is to simply use a generalpurpose kernel that performs well under various conditions (e.g., polynomial or Gaussian kernels).
In this paper, we propose KernelNet (KNet), a methodology for learning a kernel, as well as an induced clustering, directly from the observed data. In particular, we train a deep kernel, combining a neural network representation with a Gaussian kernel. More specifically, given a dataset of samples in , we learn a kernel of the form:
where is an embedding function modeled as a neural network parametrized by , and , , are normalizing constants. Intuitively, incorporating a neural network (NN) parameterization to a Gaussian kernel, we are able to learn a flexible deep kernel for clustering, tailored specifically to the given dataset.
We train our deep kernel with a spectral clustering objective based on the Hilbert Schmidt Independence Criterion (Gretton et al., 2005)
. Our training process can be interpreted as learning a nonlinear transformation
as well as its spectral embedding simultaneously. By appropriately selecting the initial values of our training process, we ensure that our clustering method is at least as powerful as spectral clustering. In particular, just as spectral clustering, our learned kernel and the induced clustering work exceptionally well on nonconvex clusters. In practice, by training the kernel directly from the data, our proposed method significantly outperforms spectral clustering.The nonlinear embedding learned directly from the dataset allows us to readily handle outofsample data. Given a new sample , not observed before, we can easily identify its cluster label by first computing its image , effectively embedding it to the same space as the (already clustered) existing dataset. This is in contrast to spectral clustering, that would require a reexecution of the algorithm from scratch on the combined dataset of samples.
The aforementioned properties of our algorithm are illustrated in Fig. 1. A dataset of samples in with nonconvex spiral clusters is shown in Fig. 1(a). Applying a Gaussian kernel directly to these samples leads to a highly uninformative similarity matrix, as shown in Fig 1(b). We train our embedding on only 1% of the samples, and apply it to the entire dataset; the dataset image, shown in Fig. 1(c), consists of nearlyconvex, linearlyseparable clusters. Most importantly, the corresponding learned kernel , shown in Fig. 1(d), yields a highly informative similarity matrix that clearly exhibits a block diagonal structure: this indicates high similarity within each cluster and low similarity across clusters.
In summary, we make the following contributions:

We propose a novel methodology of discovering a deep kernel tailored for clustering directly from data, using an objective based on the HilbertSchmidt Information Criterion.

We propose an algorithm for training the kernel by maximizing this objective, as well as for selecting a good parameter initialization. Our algorithm, KNet, can be perceived as alternating between training the kernel and discovering its spectral embedding.

We evaluate the performance of KNet with synthetic and real data compared to multiple stateoftheart methods for deep clustering. In 5 out 6 datasets, KNet outperforms stateoftheart by as much as 57.8%; this discrepancy is more pronounced in datasets with nonconvex clusters, which KNet handles very well.

Finally, we demonstrate that the algorithm has exceptional performance in clustering outofsample data. We exploit this to show that KNet can be significantly accelerated through subsampling: learning the embedding on only 1%35% of the data can be used to cluster an entire dataset, leading only to a 0%3% degradation of clustering performance.
The remainder of the paper is organized as follows. In Sec. 2, we discuss related work. In Sec. 3, we provide a brief description of the HilbertSchmidt Information Criterion and its relationship to clustering. The formal definition of the problem we solve and our proposed algorithm (KNet) are discussed in Sec. 4 and Sec 5, respectively. Sec. 6 contains our experiments. Finally, we conclude in Sec. 7.
2 Related Work
Several recent works propose autoencoders specifically designed for clustering.
Song et al. (2013) combine an autoencoder with means, including anpenalty w.r.t. distance to cluster centers, obtained by alternating between stochastic gradient descent (SGD) and cluster center assignment.
Ji et al. (2017) incorporate a subspace clustering penalty to an autoencoder, and alternate between SGD and dictionary learning. Tian et al. (2014) learn a stacked autoencoder initialized via a similarity matrix. Xie et al. (2016) incorporate a KLdivergence penalty between the encoding and a soft cluster assignment, both of which are again alternately optimized; a similar approach is followed by Guo et al. (2017) and Hu et al. (2017). In KNet, we significantly depart from these methods by using an HSICbased objective, motivated by spectral clustering. In practice, this makes KNet better tailored to learning nonconvex clusters, on which the aforementioned techniques perform poorly: we demonstrate this experimentally in Section 6.Our work is closest to Shaham et al. (2018), that use an objective motivated by spectral clustering to learn an embedding highly correlated to a given similarity matrix; this makes the embedding only as good as the chosen similarity. The resulting optimization corresponds to Eq. (6) in our paper: KNet can thus be seen as a generalization of Shaham et al. (2018) that learns both the embedding and the similarity matrix jointly; this eschews the need for discovering the right similarity tailored to the dataset, and leads to overall improved performance (see also Sec. 6).
KNet also has a direct relationship to methods for kernel learning. A series of papers (Wilson et al., 2011, 2016b, 2016a) regress deep kernels to model Gaussian processes. Zhou et al. (2004) learn (shallow) linear combinations of given kernels. Closest to us, Niu et al. (2011) use HSIC to jointly discover a subspace in which data lives as well as its spectral embedding; the latter is used to cluster the data. This corresponds to learning a kernel over a (shallow) projection of the data to a linear subspace. KNet, therefore, generalizes the work by Niu et al. (2011) to learning a deep, nonlinear kernel representation (c.f. Eq. (9)), which improves upon spectral embeddings and is used directly to cluster the data.
3 HilbertSchmidt Independence Criterion
Proposed by Gretton et al. (2005)
, the Hilbert Schmidt Independence Criterion (HSIC) is a statistical dependence measure between two random variables. Like Mutual Information (MI), it measures dependence via the distance between the joint distribution of the two random variables against the product of their individual distributions. However, compared to MI, HSIC is easier to compute empirically, since it does not require a direct estimation of the joint distribution. It is used in many applications due to this advantage, including dimensionality reduction
(Niu et al., 2011)(Song et al., 2007), and alternative clustering (Wu et al., 2018), to name a few.Formally, consider a set of i.i.d. samples , , where , are drawn from a joint distribution. Let and be the corresponding matrices comprising a sample in each row. Let also be the Gaussian kernel and be the linear kernel . Define to be the kernel matrices with entries and , respectively, and let be the normalized Gaussian kernel matrix given by
(1) 
where the degree matrix
(2) 
is a normalizing diagonal matrix. Then, the HSIC between and is estimated empirically via:
(3) 
where is a centering matrix.
Intuitively, HSIC empirically measures the dependence between samples of the two random variables. Though HSIC can be more generally defined for arbitrary kernels, this particular choice has a direct relationship with (and motivation from) spectral clustering. In particular, given , consider the optimization:
(4a)  
subject to  (4b) 
where is given by (3). Then, the optimal solution to (4) is precisely the spectral embedding of (Niu et al., 2011). Indeed, comprises the top eigenvectors of the Laplacian of , given by:
(5) 
For completeness, we prove this in Appendix A.
the kernel function discovered by KNet  
number of samples in a dataset  
number of clusters  
sample dimension  
,  data samples in 
,  data matrices in , where each row represents a single sample 
Gaussian kernel  
Linear kernel  
Gaussian kernel parameter  
,  kernel matrices computed on , 
normalized Gaussian kernel matrix, given by (1)  
degree matrix, given by (2)  
empirical estimate of HSIC, given by (3)  
centering matrix  
a vector of length containing 1s in all entries 

the Laplacian of spectral clustering, given by (5)  
spectral embedding of , obtained via eigendecomposition of or as solution to (4)  
the weights of the encoder  
the weights of the decoder  
embedding for a single sample  
embedding for the entire dataset  
decoder portion of the autoencoder  
autoencoder, given by (8)  
spectral embedding of  
pair ( and )  
Objective of optimization problem (7)  
the Laplacian of , given by (13)  
th diagonal element of degree matrix 
4 Problem Formulation
We are given a dataset of samples grouped in (potentially) nonconvex clusters. Our objective is to cluster samples by first embedding them into a space in which the clusters become convex. Given such an embedding, clusters can subsequently be identified via, e.g., means. We would like the embedding, modeled as a neural network, to be at least as expressive as spectral clustering: clusters separable by spectral clustering should also become separable via our embedding. In addition, the embedding should generalize to outofsample data, thereby enabling us to cluster new samples outside the original dataset.
Learning the Embedding and a Deep Kernel. Formally, we wish to identify clusters over a dataset of samples and features. Let be an embedding of a sample to , modeled as a DNN parametrized by ; we denote by the image of under parameters . We also denote by the embedding of the entire dataset induced by , and use for the image of .
Let be the spectral embedding of , obtained via spectral clustering. We can train to induce similar behavior as via the following optimization:
(6) 
where is given by Eq. (3). Intuitively, seeing HSIC as a dependence measure, by training so that is maximally dependent on , becomes a surrogate to the spectral embedding, sharing similar properties.
However, the surrogate learned via (6) may only be as discriminative as . To address this issue, we depart from (6) by jointly discovering both as well as a coupled spectral embedding . In particular, we solve the following optimization problem w.r.t. both the embedding and :
(7a)  
subj. to:  (7b)  
(7c) 
where the unknowns are and , and
(8) 
is an autoencoder, comprising and as an encoder and decoder respectively. The autoencoder ensures that the embedding is representative of the input . To gain some intuition into how problem (7) generalizes (6), observe that if the embedding is fixed to be the identity map (i.e., for ) then, by Eq. (4), optimizing only for produces the spectral embedding . The joint optimization of both and allows us to further improve upon , as well as on the coupled ; we demonstrate experimentally in Section 6 that this significantly improves clustering quality.
The optimization (7) can also be interpreted as an instance of kernel learning. Indeed, as discussed in the introduction, by learning , we discover in effect a normalized kernel of the form
(9) 
where are the corresponding diagonal elements of degree matrix .
OutofSample Data. The embedding can readily be applied to clustering outofsample data. In particular, having trained over dataset , given a new dataset , we can cluster this new dataset efficiently as follows. First, we use the pretrained to map every sample in to its image, producing : this effectively embeds to the same space as . From this point on, clusters can be recomputed efficiently via, e.g., means, or by mapping the images to the closest existing cluster head. In contrast to, e.g., spectral clustering, this avoids recomputing the joint embedding of the entire dataset from scratch.
The ability to handle outofsample data can be leveraged to also accelerate training. In particular, given the original dataset , computation can be sped up by training the embedding by solving (7) on a small subset of . The resulting trained can be used to embed, and subsequently cluster, the entire dataset. We show in Section 6 that this approach works very well, leading to a significant acceleration in computations without degrading clustering quality.
Convex Cluster Images. The first term in objective (7a) naturally encourages to form convex clusters. To see this, observe that, ignoring the reconstruction error, the objective (7a) becomes:
(10) 
where are the elements of matrix . The exponential terms in Eq. (10) compel samples under which to become attracted to each other, while samples for which drift farther apart. This is illustrated in Figure 2. Linearly separable, increasingly convex cluster images arise over several iterations of solving our algorithm at Eq. (7). The algorithm, KNet, is described in the next section.
5 KNet Algorithm
We solve optimization problem (7) by iteratively adapting and . In particular, we initialize to be the identity map, and to be the spectral embedding . We subsequently alternate between adapting and We adapt via stochastic gradient ascent (SGA). To optimize , we adopt two approaches: one based on eigendecomposition, and one based on optimization over the Stiefel manifold. We describe each of these steps in detail below; a summary can be found in Algorithm 1.
Initialization. The nonconvexity of (7) necessitates a principled approach for selecting good initialization points for and . We initialize to , computed via the top eigenvectors of the Laplacian of , given by (5). We initialize so that is the identity map; This is accomplished by pretraining , via SGD as solutions to:
(11) 
Note that, in this construction, we use .
Updating . A simple way to update is via a gradient ascent, i.e.:
(12) 
for , where is the objective (7a). In practice, we wish to apply stochastic gradient ascent over minibatches; for fixed, objective (7a) reduces to (10); however, the terms in the sum are coupled via the normalizing degree matrix , which depends on via (2). This significantly increases the cost of computing minibatch gradients. To simplify this computation, instead we hold both and fixed, and update via one epoch of SGA over (7a). At the conclusion of one epoch, we update the Gaussian kernel and the corresponding degree matrix via Eq. (2
). We implemented both this heuristic and regular SGA, and found that it led to a significant speedup without any observable degradation in clustering performance (see also Section
6).Updating via Eigendecomposition. Our first approach to adapting relies on the fact that, holding constant, problem (7) reduces to the form (4). That is, at each iteration, for fixed, the optimal solution
is given by the top eigenvectors of Laplacian
(13) 
Hence, given at an iteration, we update by returning . When , there are several efficient algorithms for computing the top eigenvectors–see, e.g., Fowlkes et al. (2004); Vladymyrov and CarreiraPerpiñán (2016).
Updating via Stiefel Manifold Ascent. The manifold in defined by constraint (7b), a.k.a. the Stiefel Manifold, is not convex; nevertheless, there is an extensive literature on optimization techniques over this set (Absil et al., 2008; Wen and Yin, 2013), exploiting the fact that descent directions that maintain feasibility can be computed efficiently. In particular, following Wen and Yin (2013), treating as a constant, and given a feasible and the gradient of the objective w.r.t , define
(14) 
Using and a predefined step length , the maximization proceeds iteratively via:
(15) 
where is the socalled Cayley transform, defined as
(16) 
The Cayley transform satisfies several important properties (Wen and Yin, 2013). First, starting from a feasible point, it maintains feasibility over the Stiefel manifold (7b) for all . Second, for small enough , it is guaranteed to follow an ascent direction; combined with linesearch, convergence is guaranteed to a stationary point. Finally, given by (16) can be computed efficiently from , thereby avoiding a full matrix inversion, by using the ShermanMorrisonWoodbury identity (Horn et al., 1990): this results in a complexity for (15), which is significantly faster than eigendecomposion when . In our second approach to updating , we apply (15) rather than eigendecomposition of when adapting iteratively. Both approaches are summarized in line 7 of Alg. 1; we refer to them as KNet and KNet, respectively, in our experiments in Sec. 6.
Data  Type  

Moon  1000  2  2  Geometric Shape  0.1701 
Spiral1  3000  3  2  Geometric Shape  0.1708 
Spiral2  30000  3  2  Geometric Shape  0.1811 
Cancer  683  2  9  Medical  3.3194 
Wine  178  3  13  Classification  4.939 
RCV  10000  4  5  Text  2.364 
Face  624  20  27  Image  6.883 
KNet  KNet  

HSIC  AE error  NMI  HSIC  AE error  NMI  
98.11  21.582  0.9  92.01  8.98  0.89  
98.805  72.952  0.88  94.211  72.619  0.87  
112.321  101.449  0.87  101.11  88.393  0.89  
0.005  109.008  105.369  0.908  108.222  107.221  0.903 
113.178  124.56  0.91  110.182  126.63  0.91  
0  110.895  127.23  0.924  109.363  127.11  0.908 
6 Experimental Evaluation
Datasets. The datasets we use are summarized in Table 2. The first three datasets (Moon, Spiral1, Spiral2) are synthetic and comprise nonconvex clusters; they are shown in Figure 3. Among the remaining four reallife datasets, the features of Breast Cancer (Wolberg, 1992; Mangasarian, 1990) are discrete integer values between 0 to 10. The features of the Wine dataset (Dheeru and Karra Taniskidou, 2017) consist of a mix of real and integer values. The Reuters dataset (RCV) is a collection of news articles labeled by topic. We represent each article via a TFIDF vector using the 500 most frequent words and apply PCA to further reduce the dimension to . The Face dataset (Bay et al., 2000) consists of grey scale, pixel images of 20 faces in different orientations. We reduce the dimension to
via PCA. As a final preprocessing step, we center and scale all datasets so that their mean is 0 and the standard deviation of each feature is 1.
Dataset  AEC  DEC  IMSAT  SN  SC  means  KNet  KNet 

Moon  56.2 0.0  42.2 0.0  51.3 20.3  100 0.0  72 0.0  66.1 0.0  100 0.0  100 0.0 
Spiral1  28.3 0.0  32.02 0.01  59.6 7.5  100.0 0.0  100.0 0.0  42.0 0.0  100.0 0.0  100.0 0.0 
Cancer  79.9 0.2  79.2 0.0  74.6 2.2  82.9 0.0  69.8 0.0  73 0.0  84.2 0.4  82.5 0.1 
Wine  54.6 0.01  80.6 0.0  72.3 11.4  79.7 0.2  88 0.0  42.8 0.0  91.0 0.8  90.0 0.7 
RCV  39.3 0.0  51.3 0.0  39.0 5.5  43.5 0.2  46 0.0  56 0  46.3 0.4  46.1 0.2 
Face  76.8 0.0  75.8 1.6  83.8 3.5  75.6 0.1  66.0 0.4  91.8 0.0  93 0.3  92.6 0.5 
AEC  DEC  IMSAT  SN  SC  means  KN  

Dataset  Prep  RT  Prep  RT  Prep  RT  RT  RT  RT  Prep  RT  RT 
Moon  18.23s  28.04s  5.53m  3.52s  1.2s  34.34s  5.4m  0.34s  0.043s  129s  28.9s  18.4s 
Spiral1  42.9m  2.2 h  5.73m  2.02m  53.06s  5.15m  7.4m  4.2s  0.12s  5.0m  42s  19s 
Cancer  1.65m  4.67m  5.7m  38.73s  1.1s  2.4m  3.9m  0.18s  0.03s  150s  19.3s  10.3s 
Wine  33.81s  41.69s  5.75m  38.02s  1.1s  33.32s  5.5m  0.03s  0.06s  7.7m  7.2s  3.4s 
RCV  2.36m  42.28m  6.35m  13.07m  10.39s  1.22m  7.3m  83.4s  0.35s  18m  30.5m  18.6m 
Face  27.5s  3.15m  5.74m  4.235m  1.2s  2.03m  170.9s  0.26s  0.15s  22m  20.9s  3.3s 
Clustering Algorithms. We evaluate 8 algorithms, including our two versions of KNet described in Alg. 1. For existing algorithms, we use architecture designs (e.g., depth, width) as recommended by the respective authors during training. We provide details for each algorithm below.
means: We use the CUDA implementation^{1}^{1}1https://github.com/srcd/kmcuda by Ding et al. (2015).
SC: We use the python scikit implementation of the classic spectral clustering algorithm by Ng et al. (2002).
AEC: Proposed by Song et al. (2013), this algorithm incorporates a means objective in an autoencoder. As suggested by the authors, we use 3 hidden layers of width 1000, 250, and 50, respectively, with an output layer dimension of 10.^{2}^{2}2https://github.com/developfeng/DeepClustering
DEC: Proposed by Xie et al. (2016), this algorithm couples an autoencoder with a soft cluster assignment via a KLdivergence penalty. As recommended, we use 3 hidden layers of width 500, 500, and 2000 with an output layer dimension of 10.^{3}^{3}3https://github.com/XifengGuo/DECkeras
IMSAT: Proposed by Hu et al. (2017), this algorithm trains a network adversarially by generating augmented datasets. It uses 2 hidden layers of width 1200 each, the output layer size equals the number of clusters .^{4}^{4}4https://github.com/weihua916/imsat
SN: Proposed by Shaham et al. (2018), SN uses an objective motivated by spectral clustering to map to a target similarity matrix.^{5}^{5}5https://github.com/KlugerLab/SpectralNet
KNet and KNet: These are the two versions of KNet, as described in Alg. 1, in which is updated via eigendecomposition and Stiefel Manifold Ascent, respectively. For both versions, the encoder and decoder have 3 layers. For Cancer, Wine, RCV, and Face dataset, we set the width of all hidden layers to . For Moon and Spiral1, we set the width of all hidden layers to 20. We set the Gaussian kernel to be median of the pairwise Euclidean distance between samples in each dataset (see Table 2).
Evaluation Metrics. We evaluate the clustering quality of each algorithm by comparing the clustering assignment generated to the ground truth assignment via the Normalized Mutual Information (NMI). NMI is a similarity metric lying in , with 0 denoting no similarity and 1 as an identical match between the assignments. Originally recommended by Strehl and Ghosh (2002), this statistic has been widely used for clustering quality validation (Niu et al., 2011; Dang and Bailey, 2010; Wu et al., 2018; Ross and Dy, 2013). We provide a formal definition in Appendix A in the supplement.
For each algorithm, we also measure the execution time, separating it into preprocessing time (Prep) and runtime (RT); in doing so, we separately evaluate the cost of, e.g., parameter initialization from training.
Experimental Setup. We execute all algorithms over a machine with 16 dualcore CPUs (Intel Xeon E52630 v3 @ 2.40GHz) with 32 GB of RAM with a NVIDIA 1b80 GPU. For methods we can parallelize either over the GPU or over the 16 CPUs (IMSAT,SN,means,KNet), we ran both executions and recorded the fastest time. The code provided for DEC could only be parallelized over the GPU, while methods AEC and SC could only be executed over the CPUs. For each dataset in Table 4, we run all algorithms on the full dataset 10 times and report the mean and standard deviation of the NMI of their resulting clustering performance against the ground truth. As SGA is randomized, we repeat experiments 10 times and report NMI averages and standard deviations.
For algorithms that can be executed outofsample (AEC, DEC, IMSAT, SN, KNET), we repeat the above experiment by training the embedding only on a random subset of the dataset. Subsequently, we apply the trained embedding to the entire dataset and cluster it via means. For comparison, we also find cluster centers via means on the subset and new samples to the nearest cluster center. For each dataset, we set the size of the subset (reported in Table 6) so that the spectrum of the resulting subset is close, in the sense, to the spectrum of .
Dataset  Data %  AEC  DEC  IMSAT  SN  means  KNet  KNet 

Moon  25%  51.05  45.6  45.5  100  21.5  100  100 
Spiral1  10%  32.2  49.5  48.7  100  56.7  100  100 
Cancer  30%  76.4  76.9  74.9  82.2  Fails  84.0  83.6 
Wine  75%  49.1  81.5  69.4  77.2  25.0  91.1  89.3 
RCV  6%  26.78  43.23  35.2  41.3  52.5  45.2  43.1 
Face  35%  52.5  67.1  77.8  75.5  87.2  92.7  91.1 
AEC  DEC  IMSAT  SN  means  KN  

Data  Prep  RT  Prep  RT  Prep  RT  RT  RT  Prep  RT  RT 
Moon  15.01s  22.34s  3.35m  2.95s  1.1s  27.63s  140.1s  0.034s  48s  2.3s  1.1s 
Spiral1  3.2m  8.31m  4.17m  1.65m  47.05s  3.83m  223.3s  0.071s  82s  3.1s  1.4s 
Cancer  55.3s  1.25m  5.1m  30.01s  1.65s  1.67m  38.9s  0.03s  16s  4s  1.8s 
Wine  25.3s  35.31s  4.65m  32.31s  2.4s  25.6s  20.4s  0.03s  74s  1.3s  0.089s 
RCV  63s  5.27m  5.65m  11.21m  3.21s  56s  40.67s  0.03s  1.2m  5.5s  2.1s 
Face  15.1s  2.86m  5.25m  3.89m  1.1s  1.56m  35.09s  0.101s  76.8  2.2  0.96s 
6.1 Results
Selecting .
As clustering is unsupervised, we cannot rely on ground truth labels to identify the best hyperparameter
. We, therefore, need an unsupervised method for selecting this value. We find that, in practice, selecting works quite well. Because the problem is not convex, local optima reached by KNet depend highly on the initialization. Initializing (a) to approximate the identity map via (11), and (b) to be the spectral embedding indeed leads to a local maximum that is highly dependent on the input , eschewing the need for the reconstruction error in the objective (7); this phenomenon is also observed by Li et al. (2017). We also found that, as anticipated by (6), eliminating the second term produces more convex cluster images.Table 3 shows the NMI of the Wine dataset for different values of ; tables for additional datasets can be found in Appendix A in the supplement. We also provide the HSIC and AE reconstruction error at convergence. Beyond the good performance of , the table suggests that an alternative unsupervised method is to select so that the ratio of the two terms at convergence is close to one. Of course, this comes at the cost of parameter exploration; in the remainder of this section, we report the performance of KNet with set to 0.
Comparison Against StateoftheArt. Table 4 shows the NMI performance of different algorithms over different datasets. With the exception of RCV dataset, we see that KNet outperforms every algorithm in Table 4. AEC, DEC, and IMSAT perform especially poorly when encountering nonconvex clusters as shown in the first two rows of the table. Spectral clustering (SC), and SN that is also based on a spectralclustering motivated objective, perform equally well as KNet on discovering nonconvex clusters. Nevertheless, KNet outperforms them for real datasets: e.g., for the Face dataset, KNet surpasses SN by 28%. Note that, for the RCV dataset,
means outperformed all methods, though overall performance is quite poor; a reason for this may be the poor quality of features extracted via TFIDFs and PCA.
KNet’s ability to handle nonconvex clusters is evident in the improvement over means to KNet for the first two datasets. The kernel matrix shown in Fig. 1(b) illustrates why means performs poorly on this dataset. In contrast, the increasingly convex cluster images learned by KNet, as shown in Fig. 1(c), lead to much better separability. This is consistently observed for both the Moon and Spiral1 dataset, for which KNet achieves NMI; we elaborate on this further in Appendix A. demonstrate KNet’s ability to generate convex representations even when the initial representation is nonconvex.
We also note that KNet consistently outperforms spectral clustering. This is worth noting because, as discussed in Sec. 4, KNet’s initialization of both and are tied to the spectral embedding. Table 4 indicates that alternatively learning both the kernel and the corresponding spectral embedding indeed leads to improved clustering performance.
Table 5 shows the time performance of each algorithm. In terms of total time, KNet is faster than AEC and DEC. We also observe that SN is faster than most algorithms in total time. However, SN is the most time intensive algorithm, since it requires extensive hyperparameter tuning to reach the reported NMI performance (see App. A). We note that a significant percentage of the total time for KNet is spent in the preprocessing step, with KNet being faster than KNet. This is due to the initialization of , i.e., training the corresponding autoencoder. Improving this initialization process could dramatically speed up the total runtime. Alternatively, as discussed in the next section, using only a small subset to train the embedding and clustering outofsample can also significantly accelerate the total runtime, without a considerable NMI degradation.
OutofSample Clustering. We report outofsample clustering NMI performance in Table 6; note that SC cannot be executed outofsample. Each algorithm is trained using only a subset of samples, whose size is indicated on the table’s first column. Once trained, we report in the table the clustering quality of applying each algorithm to the full set without retraining. We observe that, with the exception of RCV, KNet clearly outperforms all benchmark algorithms in terms of clustering quality. This implies that KNet is capable of generalizing the results by using as litle as 6% of the data.
By comparing Table 4 against 6, we see that AEC, DEC, and IMSAT suffers a significant quality reduction, while KNet suffer only a maximum degradation of 3%. Therefore, training on a small subset of the data not only yields highquality results, the results are almost identical to training on the full set itself. Table 7, reporting corresponding times, indicates that this can also lead to a significant acceleration, especially of the preprocessing step. Together, these two observations indicate that KNet can indeed be applied to clustering of large nonconvex datasets by training the embedding on only a small subset of the provided samples.
7 Conclusions
KNet performs unsupervised kernel discovery using only a subset of the data. By discovering a kernel that optimizes the Spectral Clustering objective, it simultaneously discovers an approximation of its embedding through a DNN. Furthermore, experimental results have confirmed that KNet can be trained using only a subset of the data.
References
 Optimization algorithms on matrix manifolds. Princeton University Press, Princeton, NJ. Cited by: §5.
 The uci kdd archive of large data sets for data mining research and experimentation. ACM SIGKDD Explorations Newsletter 2 (2), pp. 81–85. Cited by: §6.
 Generation of alternative clusterings using the cami approach. In Proceedings of the 2010 SIAM International Conference on Data Mining, pp. 118–129. Cited by: §6.
 UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: §6.
 Yinyang kmeans: a dropin replacement of the classic kmeans with consistent speedup. In International Conference on Machine Learning, pp. 579–587. Cited by: §6.
 Spectral grouping using the nystrom method. IEEE transactions on pattern analysis and machine intelligence 26 (2), pp. 214–225. Cited by: §5.
 Measuring statistical dependence with hilbertschmidt norms. In International Conference on Algorithmic Learning Theory, pp. 63–77. Cited by: §1, §3.
 Deep clustering with convolutional autoencoders. In International Conference on Neural Information Processing, pp. 373–382. Cited by: §2.
 Matrix analysis. Cambridge university press. Cited by: §5.
 Learning discrete representations via information maximizing self augmented training. arXiv preprint arXiv:1702.08720. Cited by: §2, §6.
 Deep subspace clustering network. In Advances in Neural Information Processing Systems, Cited by: §2.

MMD GAN: towards deeper understanding of moment matching network
. In Advances in Neural Information Processing Systems, pp. 2203–2213. Cited by: §6.1. 
Cancer diagnosis via linear programming
. SIAM news 23 (5), pp. 18. Cited by: §6. 
On spectral clustering: analysis and an algorithm
. In Advances in Neural Information Processing Systems, pp. 849–856. Cited by: §6. 
Dimensionality reduction for spectral clustering.
In
Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics
, pp. 552–560. Cited by: Appendix A, §2, §3, §3, §6.  Nonparametric mixture of gaussian processes with constraints. In International Conference on Machine Learning, pp. 1346–1354. Cited by: §6.
 SpectralNet: spectral clustering using deep neural networks. In International Conference on Learning Representations, External Links: Link Cited by: §2, §6.

Autoencoder based data clustering.
In
Iberoamerican Congress on Pattern Recognition
, pp. 117–124. Cited by: §2, §6.  Supervised feature selection via dependence estimation. In Proceedings of the 24th international conference on Machine learning, pp. 823–830. Cited by: §3.
 Cluster ensembles—a knowledge reuse framework for combining multiple partitions. Journal of machine learning research 3 (Dec), pp. 583–617. Cited by: §6.
 Learning deep representations for graph clustering.. In AAAI, pp. 1293–1299. Cited by: §2.
 The variational nystrom method for largescale spectral problems. In International Conference on Machine Learning, pp. 211–220. Cited by: §5.
 A feasible method for optimization with orthogonality constraints. Mathematical Programming 142 (12), pp. 397–434. Cited by: §5.
 Stochastic variational deep kernel learning. In Advances in Neural Information Processing Systems, pp. 2586–2594. Cited by: §2.
 Deep kernel learning. In Artificial Intelligence and Statistics, pp. 370–378. Cited by: §2.
 Gaussian process regression networks. arXiv preprint arXiv:1110.4411. Cited by: §2.
 Wisconsin breast cancer dataset. University of Wisconsin Hospitals (), pp. . Cited by: §6.
 Iterative spectral method for alternative clustering. In International Conference on Artificial Intelligence and Statistics, pp. 115–123. Cited by: §3, §6.
 Unsupervised deep embedding for clustering analysis. In International Conference on Machine Learning, pp. 478–487. Cited by: §2, §6.
 Learning with local and global consistency. In Advances in neural information processing systems, pp. 321–328. Cited by: §2.
Appendix A Relating HSIC to Spectral Clustering
Proof.
Using Eq. (3) to compute HSIC emprically, Eq (4) can be rewritten as
(17) 
where and are the kernel matrices computed from and . As shown by Niu et al. (2011), if we let be a linear kernel such that , add the constraint such that and rotate the trace terms we get
(18a)  
s.t :  (18b) 
By setting the Laplacian as , the formulation becomes identical to Spectral Clustering as
(19a)  
s.t :  (19b) 
∎
Appendix A Effect of on HSIC, AE reconstruction error, and NMI
KNet  KNet  

HSIC  AE error  NMI  HSIC  AE error  NMI  
2.989  0.001  0.446  2.989  0  0.421  
2.989  0.0001  0.412  2.989  0  0.414  
2.989  0.0001  0.421  2.989  0  0.418  
2.989  0.001  0.434  2.989  0  0.419  
2.988  0.001  0.421  2.988  0  0.418  
2.965  0.099  0.557  2.979  0.033  0.558  
1.982  0.652  0.644  2.33  0.583  0.56  
2.124  0.669  0.969  2.037  0.66  0.824  
2.249  0.69  1  2.368  0.67  1  
0  2.278  0.715  1  2.121  0.718  1 
Appendix A Normalized Mutual Information
Consider two clustering assignments assigning labels in to samples in dataset . We represent these two assignments through two partitions of , namely , , where
are the sets of samples receiving label under each assignment. Define the empirical distribution of labels to be:
(20) 
The NMI is then given by the ratio
where is the mutual information and are the entropies of the marginals of the joint distribution (20).
Appendix A Moon and Spiral dataset
In this appendix, we illustrate that for the Moon and the Spiral datasets, we are able to learn (a) convex images for the clusters and (b) kernels that produces block diagonal structures. The kernel matrix is constructed with a Gaussian kernel, therefore the values are between 0 to 1. The kernel matrices shown in the figures below use white as 0 and dark blue as 1; all values in between are shown as a gradient between the two colors.
In Figure 4, the Moon dataset is plotted in Fig. 4(a) and its kernel block structure in Fig. 4(b). After training , the image of is shown in Fig. 4(c) along with its block diagonal structure in Fig. 4(d). Using the same trained on , we distorted with Gaussian noise and plot it in Fig. 5(a) along with its kernel matrix in Fig. 5(b). We then pass the distorted into and plot the resulting image in Fig. 5(c) along with its kernel matrix in Fig. 5(d). From this example, we demonstrate KNet’s ability to embed data into convex clusters even under Gaussian noise.
In Figure 6, a subset of 300 samples of the Spiral dataset is plotted in Fig. 6(a) and its kernel block structure in Fig. 6(b). After training , the image of is shown in Fig. 6(c) along with its block diagonal structure in Fig. 6(d). The full dataset is shown in (a) of Figure 7 along with its kernel matrix in Fig. 7(b). Using the same trained from Fig. 6(a), we pass the full dataset into and plot the resulting image in Fig. 6(b) along with its kernel matrix in Fig. 6(d). From this example, we demonstrate KNet’s ability to generalize convex embedding using only 1% of the data.
Appendix A Algorithm Hyperparameter Details
Algorithm  Learning Rate  Batch size  Optimizer 

AEC    100  SGD 
DEC  0.01  256  SGDSolver 
IMSAT  0.002  250  Adam 
SN  0.001  128  RMSProp 
KNet  0.001  5  Adam 
The hyperparameters for each network are set as outlined in the respective papers. In the case of SN, the hyperparameters included the number of neighbors for calculation, the number of neighbors to use for graph Laplacian affinity matrix, the number of neighbors to use to calculate the scale of gaussian graph laplacian, and the threshold for calculating the closest neighbors in the Siamese network. They were set by doing a grid search over values ranging from 2 to 10 and by using the loss over
of the training data.