1 Introduction
Discovering clusters in unlabeled data is a task of significant scientific and practical value. With technological progress images, texts, and other types of data are acquired in large numbers. Their labeling, however, is often expensive, tedious, or requires expert knowledge. Clustering techniques provide useful tools to analyze such data and to reveal its underlying structure.
Spectral Clustering (Shi & Malik, 2000; Ng et al., 2002; Von Luxburg, 2007) is a leading and highly popular clustering algorithm. It works by embedding the data in the eigenspace of the Laplacian matrix, derived from the pairwise similarities between data points, and applying means on this representation to obtain the clusters. Several properties make spectral clustering appealing: First, its embedding optimizes a natural cost function, minimizing pairwise distances between similar data points; moreover, this optimal embedding can be found analytically. Second, spectral clustering variants arise as relaxations of graph balancedcut problems (Von Luxburg, 2007). Third, spectral clustering was shown to outperform other popular clustering algorithms such as means (Von Luxburg, 2007)
, arguably due to its ability to handle nonconvex clusters. Finally, spectral clustering has a solid probabilistic interpretation, since the Euclidean distance in the embedding space is equal to a diffusion distance, which, informally, measures the time it takes probability mass to transfer between points, via other points in the dataset
(Nadler et al., 2006; Coifman & Lafon, 2006a).While spectral embedding of data points can be achieved by a simple eigendecomposition of their graph Laplacian matrix, with large datasets direct computation of eigenvectors may be prohibitive. Moreover, generalizing a spectral embedding to unseen data points, a task commonly referred to as outofsampleextension (OOSE), is a nontrivial task; see, for example,
(Bengio et al., 2004; Fowlkes et al., 2004; Coifman & Lafon, 2006b).In this work we introduce SpectralNet, a deep learning approach to spectral clustering, which addresses the scalability and OOSE problems pointed above. Specifically, it is trained in a stochastic fashion, which allows it to scale. Moreover, once trained, it provides a function, implemented as a feedforward network, that maps each input data point to its spectral embedding coordinates. This map can easily be applied to new test data. Unlike optimization of standard deep learning models, SpectralNet is trained using constrained optimization, where the constraint (orthogonality of the net outputs) is enforced by adding a linear layer, whose weights are set by the QR decomposition of its inputs. In addition, as good affinity functions are crucial for the success of spectral clustering, rather than using the common Euclidean distance to compute Gaussian affinity, we show how Siamese networks can be trained from
unlabeleddata to learn pairwise distances and consequently significantly improve the quality of the clustering. Further improvement can be achieved if our network is applied to transformed data obtained by an autoencoder (AE). On the theoretical front, we utilize VCdimension theory to derive a lower bound on the size of neural networks that compute spectral clustering. Our experiments indicate that our network indeed approximates the Laplacian eigenvectors well, allowing the network to cluster challenging nonconvex point sets, which recent deep network based methods fail to handle; see examples in Figure
1. Finally, SpetralNet achieves competitive performance on MNIST handwritten digit dataset and stateoftheart on the Reuters document dataset, whose size makes standard spectral clustering inapplicable.2 Related work
Recent deep learning approaches to clustering largely attempt to learn a code for the input that is amenable to clustering according to either the means or mixture of gaussians clustering models. DCN (Yang et al., 2017) directly optimizes a loss composed of a reconstruction term (for the code) and the means functional. DEC (Xie et al., 2016) iteratively updates a target distribution to sharpen cluster associations. DEPICT (Dizaji et al., 2017) adds a regularization term that prefers balanced clusters. All three methods are pretrained as autoencoders, while DEPICT also initializes its target distribution using means (or other standard clustering algorithms). Several other recent approaches rely on a variational autoencoder that utilizes a Gaussian mixture prior, see, for example, VaDE (Zheng et al., 2016) and GMVAE (Dilokthanakul et al., 2016). IMSAT (Hu et al., 2017) is based on data augmentation, where the net is trained to maximize the mutual information between inputs and predicted clusters, while regularizing the net so that the cluster assignment of original data points will be consistent with the assignment of augmented points. Different approaches are proposed by Chen (2015), who uses a deep belief net followed by nonparametric maximum margin clustering (NMMC), and Yang et al. (2016), who introduce a recurrentagglomerative framework to image clustering.
While these approaches achieve accurate clustering results on standard datasets (such as the MNIST and Reuters), the use of the means criterion, as well as the Gaussian mixture prior, seems to introduce an implicit bias towards the formation of clusters with convex shapes. This limitation seems to hold even in code space. This bias is demonstrated in Figure 1(bottom), which shows the failure of several of the above approaches on relatively simple clustering tasks. In contrast, as is indicated in Figure 1(top), our SpectralNet approach appears to be less vulnerable to such bias. The full set of runs can be found in Appendix A.
In the context of spectral clustering, Tian et al. (2014) learn an autoencoder that maps the rows of a graph Laplacian matrix onto the corresponding spectral embedding, and then use means in code space to cluster the underlying data. Unlike our work, which learns to map an input data point to its spectral embedding, Tian et al.’s network takes as input an entire row of the graph Laplacian, and therefore OOSE is impractical, as it involves computing in preprocessing the affinities of each new data point to all the training data. Also of interest is a kernel spectral method by Alzate & Suykens (2010), which allows for out of sample extension and handles large datasets through smart sampling (but does not use a neural network).
Yi et al. (2016) address the problem of 3D shape segmentation. Their work, which focuses on learning graph convolutions, uses a graph spectral embedding through eigenvector decomposition, which is not learned
. In addition, we enforce orthogonalization stochastically through a constraint layer, while they attempt to learn orthogonalized functional maps by adding an orthogonalization term to the loss function, which involves nontrivial balancing between two loss components.
Other deep learning works use a spectral approach in the context of supervised learning. Mishne et al. (2017) trained a network to map graph Laplacian matrices to their eigenvectors using supervised regression. Their approach, however, requires the true eigenvectors for training, and hence does not easily scale to large datasets. Law et al. (2017)
apply supervised metric learning, showing that their method approximates the eigenvectors of a 01 affinity matrix constructed from the true labels. A related approach is taken by
Hershey et al. (2016), where their net learn to embedding of the data on which the dot product affinity is similar to the affinity obtained from the true labels.Finally, a number of papers showed that stochastic gradient descent can be used effectively to compute eigendecomposition.
Han & Filippone (2017) apply this for spectral clustering. Unlike SpectralNet, however, their method does not compute the eigenvector embedding as functions of the data, and so outofsample extension is not possible. Shamir (2015)uses SGD to compute the principal components of covariance matrices (see also references therein). Their setup assumes that in each iteration a noisy estimate of the entire input matrix is provided. In contrast, in our work we use in each iteration only a small submatrix of the affinity matrix, corresponding to a small minibatch. In future work, we plan to examine how these algorithms can be adapted to improve the convergence rate of our proposed network.
3 SpectralNet
In this section we present our proposed approach, describe its key components, and explain its connection to spectral clustering. Consider the following standard clustering setup: Let denote a collection of unlabeled data points drawn from some unknown distribution ; given a target number of clusters and a distance measure between points, the goal is to learn a similarity measure between points in and use it to learn a map that assigns each of to one of possible clusters, so that similar points tend to be grouped in the same cluster. As in classification tasks we further aim to use the learned map to determine the cluster assignments of new, yet unseen, points drawn from . Such outofsampleextension is based solely on the learned map, and requires neither computation of similarities between the new points and the training points nor reclustering of combined data.
In this work we propose SpectralNet, a neural network approach for spectral clustering. Once trained, SpectralNet computes a map and a cluster assignment function . It maps each input point to an output and provides its cluster assignment . The spectral map
is implemented using a neural network, and the parameter vector
denotes the network weights.The training of SpectralNet consists of three components: (i) unsupervised learning of an affinity given the input distance measure, via a Siamese network (see Section
3.2); (ii) unsupervised learning of the map by optimizing a spectral clustering objective while enforcing orthogonality (see Section 3.1); (iii) learning the cluster assignments, by kmeans clustering in the embedded space.
3.1 Learning the Spectral Map
In this section we describe the main learning step in SpectralNet, component (ii) above. To this end, let be a symmetric affinity function, such that expresses the similarity between and . Given , we would like points which are similar to each other (i.e., with large ) to be embedded close to each other. Hence, we define the loss
(1) 
where , the expectation is taken with respect to pairs of i.i.d. elements drawn from , and denotes the parameters of the map Clearly, the loss can be minimized by mapping all points to the same output vector ( for all ). To prevent this, we require that the outputs will be orthonormal in expectation with respect to , i.e.,
(2) 
As the distribution is unknown, we replace the expectations in (1) and (2) by their empirical analogues. Furthermore, we perform the optimization in a stochastic fashion. Specifically, at each iteration we randomly sample a minibatch of samples, which without loss of generality we denote , and organize them in an matrix whose th row contains . We then minimize the loss
(3) 
where and is a matrix such that . The analogue of (2) for a small minibatch is
(4) 
where is a matrix of the outputs whose th row is .
We implement the map as a general neural network whose last layer enforces the orthogonality constraint (4). This layer gets input from units, and acts as a linear layer with outputs, where the weights are set to orthogonalize the output for the minibatch . Let denote the matrix containing the inputs to this layer for (i.e., the outputs of over the minibatch before orthogonalization), a linear map that orthogonalizes the columns of is computed through its QR decomposition. Specifically, for any matrix such that is full rank, one can obtain the QR decomposition via the Cholesky decomposition , where is a lower triangular matrix, and then setting . This is verified in Appendix B. Therefore, in order to orthogonalize , the last layer multiplies from the right by , where is obtained from the Cholesky decomposition of and the factor is needed to satisfy (4).
We train this spectral map in a coordinate descent fashion, where we alternate between orthogonalization and gradient steps. Each of these steps uses a different minibatch (possibly of different sizes), sampled uniformly from the training set
. In each orthogonalization step we use the QR decomposition to tune the weights of the last layer. In each gradient step we tune the remaining weights using standard backpropagation. Once SpectralNet is trained, all the weights are freezed, including those of the last layer, which simply acts as a linear layer. Finally, to obtain the cluster assignments
, we propagate through it to obtain the embeddings , and perform means on them, obtaining cluster centers, as in standard spectral clustering. These algorithmic steps are summarized below in Algorithm 1 in Sec. 3.3.Connection with Spectral Clustering The loss (3) can also be written as
where is a diagonal matrix such that . The symmetric, positive semidefinite matrix forms the (unnormalized) graph Laplacian of the minibatch . For the loss is minimized when is the eigenvector of
corresponding to the smallest eigenvalue. Similarly, for general
, under the constraint (4), the minimum is attained when the column space of is the subspace of the eigenvectors corresponding to the smallest eigenvalues of . Note that this subspace includes the constant vector whose inclusion does not affect the final cluster assignment.Hence, SpectralNet approximates spectral clustering, where the main differences are that the training is done in a stochastic fashion, and that the orthogonality constraint with respect to the full dataset holds only approximately. SpectralNet therefore trades accuracy with scalability and generalization ability. Specifically, while its outputs are an approximation of the true eigenvectors, the stochastic training enables its scalability and thus allows one to cluster large datasets that are prohibitive for standard spectral clustering. Moreover, once trained, SpectralNet provides a parametric function whose image for the training points is (approximately) the eigenvectors of the graph Laplacian. This function can now naturally embed new test points, which were not present at training time. Our experiments with the MNIST dataset (Section 5) indicate that the outputs of SpectralNet closely approximate the true eigenvectors.
Finally, as in common spectral clustering applications, cluster assignments are determined by applying means to the embeddings . We note that the means step can be replaced by other clustering algorithms. Our preference to use means is based on the interpretation (for normalized Laplacian matrices) of the Euclidean distance in the embedding space as diffusion distance in the input space (Nadler et al., 2006; Coifman & Lafon, 2006a).
Normalized graph Laplacian In spectral clustering, the symmetric normalized graph Laplacian can use as an alternative to the unnormalized Laplacian . In order to train SpectralNet with normalized graph Laplacian, the loss function (3) should be replaced by
(5) 
where .
Batch size considerations In standard classification or regression loss functions, the loss is a sum over the losses of individual examples. In contrast, SpectralNet loss (3) is summed over pairs of points, and each summand describes relationships between data points. This relation is encoded by the full affinity matrix (which we never compute explicitly). The minibatch size
should therefore be sufficiently large to capture the structure of the data. For this reason, it is also highly important that minibatches will be sampled at random from the entire dataset at each iteration step, and not be fixed across epochs. When the minibatches are fixed, the knowledge of
is reduced to a (possibly permuted) diagonal sequence of blocks, thus ignoring many of the entries of . In addition, while the output layer orthogonalizes , we do not have any guarantees on how well it orthogonalizes other random minibatches. However, in our experiments we observed that if is large enough, it approximately orthogonalizes other batches as well, and its weights stabilize as training progresses. Therefore, to train SpectralNet, we use larger minibatches compared to common choices made by practitioners in the context of classification. In our experiments we uses minibatches of size 1024 for MNIST and 2048 for Reuters, resampled randomly at every iteration step.3.2 Learning affinities using a Siamese network
Choosing a good affinity measure is crucial to the success of spectral clustering. In many applications, practitioners use an affinity measure that is positive for a set of nearest neighbor pairs, combined with a Gaussian kernel with some scale , e.g.,
(6) 
where one typically symmetrizes , for example, by setting .
Euclidean distance may be overly simplistic measure of similarity; seeking methods that can capture more complex similarity relations might turn out advantageous. Siamese nets (Hadsell et al., 2006; Shaham & Lederman, 2018) are trained to learn affinity relations between data points; we empirically found that the unsupervised application of a Siamese net to determine the distances often improves the quality of the clustering.
Siamese nets are typically trained on a collection of similar (positive) and dissimilar (negative) pairs of data points. When labeled data are available, such pairs can be chosen based on label information (i.e., pairs of points with the same label are considered positive, while pairs of points with different labels are considered negative). Here we focus on datasets that are unlabeled. In this case we can learn the affinities directly from Euclidean proximity or from graph distance, e.g., by “labeling” points positive if is small and negative otherwise. In our experiments, we construct positive pairs from the nearest neighbors of each point. Negative pairs are constructed from points with larger distances. This Siamese network, therefore, is trained to learn an adaptive nearest neighbor metric.
A Siamese net maps every data point into an embedding in some space. The net is typically trained to minimize contrastive loss, defined as
where is a margin (typically set to 1).
Once the Siamese net is trained, we use it to define a batch affinity matrix for the training of SpectralNet, by replacing the Euclidean distance in (6) with .
Remarkably, despite being trained in an unsupervised fashion on a training set constructed from relatively naive nearest neighbor relations, in Section 5 we show that affinities that use the Siamese distances yield dramatically improved clustering quality over affinities that use Euclidean distances. This implies that unsupervised training of Siamese nets can lead to learning useful and rich affinity relations.
3.3 Algorithm
Our endtoend training approach is summarized in Algorithm 1.
Once SpectralNet is trained, computing the embeddings of new test points (i.e., outofsampleextension) and their cluster assignments is straightforward: we simply propagate each test point through the network to obtain their embeddings , and assign the point to its nearest centroid, where the centroids were computed using means on the training data, at the last line of Algorithm 1.
3.4 Spectral clustering in code space
Given a dataset , one can either apply SpectralNet in the original input space, or in a code space (obtained, for example, by an autoencoder). A code space representation is typically lower dimensional, and often contains less nuisance information (i.e., information on which an appropriate similarity measure should not depend). Following (Yang et al., 2017; Xie et al., 2016; Zheng et al., 2016) and others, we empirically observed that SpectralNet performs best in code space. Unlike these works, which use an autoencoder as an initialization for their clustering networks, we use the code as our data representation and apply SpectralNet directly in that space, (i.e., we do not change the code space while training SpectralNet). In our experiments, we use code spaces obtained from publicly available autoencoders trained by Zheng et al. (2016) on the MNIST and Reuters datasets.
4 Theoretical analysis
Our proposed SpectralNet not only determines cluster assignments in training, as clustering algorithms commonly do, but it also produces a map that can generalize to unseen data points at test time. Given a training set with points, it is thus natural to ask for how large should such a network be to represent this spectral map. T he theory of VCdimension can provide useful worstcase bounds for this size.
In this section, we use the VC dimension theory to study the minimal size a neural network should have in order to compute spectral clustering for
. Specifically, we consider the class of functions that map all training points to binary values, determined by thresholding at zero the eigenvector of the graph Laplacian with the second smallest eigenvalue. We denote this class of binary classifiers
. Note that with ,means can be replaced by thresholding of the second smallest eigenvector, albeit not necessarily at zero. We are interested in the minimal number of weights and neurons required to allow the net to compute such functions, assuming the affinities decay exponentially with the Euclidean distance. We do so by studying the VC dimension of function classes obtained by performing spectral clustering on
points in arbitrary Euclidean spaces , with . We will make no assumption on the underlying distribution of the points.In the main result of this section, we prove a lower bound on the VC dimension of spectral clustering, which is linear in the number of points . In contrast, the VC dimension of means, for example, depends solely on the dimension , but not on , hence making means significantly less expressive than spectral clustering^{1}^{1}1For two clusters in , means clustering partitions the data using a linear separation. It is well known that the VC dimension of the class of linear classifiers in is . Hence, means can shatter at most points in , regardless of the size of the dataset. . As a result of our main theorem, we bound from below the number of weights and neurons in any net that is required to compute Laplacian eigenvectors. The reader might find the analysis in this section interesting in its own right.
Our main result shows that for data in with , the VC dimension of is linear in the number of points, making spectral clustering almost as rich as arbitrary clustering of the points.
Theorem 4.1.
.
The formal proof of Theorem 4.1 is deferred to Appendix C. Below we informally sketch its principles. We want to show that for any integer (assuming for simplicity that is divisible by 10), there exists a set of points in that is shattered by . In particular, we show this for the set of points placed in a 2dimensional grid in . We then show that for any arbitrary dichotomy of these points, we can augment the set of points to a larger set , containing points, with a balanced partition of into two disjoint sets and that respects the dichotomy of the original points. The larger set has the special properties: (1) within (and resp. ), there is a path between any two points such that the distances between all pairs of consecutive points along the path are small, and (2) all pairs are far apart. We complete the proof by constructing a Gaussian affinity with a suitable value of and showing that the minimizer of the spectral clustering loss for (i.e., the second eigenvector of the Laplacian), when thresholded at 0, separates from , and hence respects the original dichotomy.
By connecting Theorem 4.1 with known results regarding the VC dimension of neural nets, see, e.g., (ShalevShwartz & BenDavid, 2014), we can bound the size from below (in terms of number of weights and neurons) of any neural net that computes spectral clustering. This is formalized in the following corollary.
Corollary 4.2.

For the class of neural nets with sigmoid nodes and weights to represent all functions realizable by spectral clustering (i.e., second eigenvector of the Laplacian, thresholded at 0) on points, it is necessary to have .

for the class of neural nets with weights from a finite family (e.g., singleprecision weights) to represent all functions realizable by spectral clustering, it is necessary to have .
Proof.

The VC dimension of the class of neural nets with sigmoid units and weights is at most (see (ShalevShwartz & BenDavid, 2014), p. 275). Hence, if , such net cannot shatter any collection of points of size . From Theorem 4.1, shatters at least points. Therefore, in order for a class of networks to be able to express any function that can be computed using spectral clustering, it is a necessary (but not sufficient) condition to satisfy .

The VC dimension of the class of neural nets with weights from a finite family is (see (ShalevShwartz & BenDavid, 2014) p. 276). The arguments above imply that .
∎
Corollary 4.2 implies that in the general case (i.e., without assuming any structure on the data points), to perform spectral clustering, the size of the net has to grow with . However, when the data has some geometric structure, the net size can be much smaller. Indeed, in a related result, the ability of neural networks to learn eigenvectors of Laplacian matrices was demonstrated both empirically and theoretically by Mishne et al. (2017)
. They proved that there exist networks which approximate the eigenfunctions of manifold Laplacians arbitrarily well (where the size of the network depends on the desired error and the parameters of the manifold, but not on
).5 Experimental results
5.1 Evaluation metrics
To numerically evaluate the accuracy of the clustering, we use two commonly used measures, the unsupervised clustering accuracy (ACC), and the normalized mutual information (NMI). For completeness, we define ACC and NMI below, and refer the reader to (Cai et al., 2011) for more details. For data point , let and denote its true label and predicted cluster, respectively. Let and similarly .
ACC is defined as
where is the collection of all permutations of . The optimal permutation can be computed using the KuhnMunkres algorithm (Munkres, 1957).
NMI is defined as
where denotes the mutual information between and , and denotes their entropy. Both ACC and NMI are in , with higher values indicating better correspondence the clusters and the true labels.
5.2 Clustering
We compare SpectralNet to several deep learningbased clustering approaches on two real world datasets. In all runs we assume the number of clusters is given (k=10 in MNIST and k=4 in Reuters). As a reference, we also report the performance of means and (standard) spectral clustering. Specifically, we compare SpectralNet to DEC (Xie et al., 2016), DCN (Yang et al., 2017), VaDE (Zheng et al., 2016), JULE (Yang et al., 2016), DEPICT (Dizaji et al., 2017), and IMSAT (Hu et al., 2017). The results for these six methods are reported in the corresponding papers. Technical details regarding the application of means and spectral clustering appear in Appendix D.
We considered two variants of Gaussian affinity functions: using Euclidean distances (6), and Siamese distances; the latter case follows Algorithm 1. In all experiments we used the loss (3). In addition, we report results of SpectralNet (and the Siamese net) in both input space and code space. The code spaces are obtained using the publicly available autoencoders which are used to pretrain the weights of VaDE^{2}^{2}2https://github.com/slim1017/VaDE/tree/master/pretrain_weights, and are 10dimensional. We refer the reader to Appendix D for technical details about the architectures and training procedures.
5.2.1 Mnist
MNIST is a collection of 70,000 grayscale images of handwritten digits, divided to training (60,000) and test (10,000) sets. To construct positive pairs for the Siamese net, we paired each instance with its two nearest neighbors. An equal number of negative pairs were chosen randomly from nonneighboring points.
Table 1 shows the performance of the various clustering algorithms on the MNIST dataset, using all 70,000 images for training. As can be seen, the performance of SpectralNet is significantly improved when using Siamese distance instead of Euclidean distance, and when the data is represented in code space rather than in pixel space. With these two components, SpectralNet outperforms DEC, DCN, VaDE, DEPICT and JULE, and is competitive with IMSAT.
Algorithm  ACC (MNIST)  NMI (MNIST)  ACC (Reuters)  NMI (Reuters) 
means  .534  .499  .533  .401 
Spectral clustering  .717  .754  NA  NA 
DEC  .843  .8  .756  not reported 
DCN  .83  .81  not reported  not reported 
VaDE  .9446  not reported  .7938  not reported 
JULE  not reported  .913  not reported  not reported 
DEPICT  .965  .917  not reported  not reported 
IMSAT  .984.004  not reported  .719  not reported 
SpectralNet (input space, Euclidean distance)  .622.008  .687.004  .645.01  .444.01 
SpectralNet (input space, Siamese distance)  .826.03  .884.02  .661 017  .381 .057 
SpectralNet (code space, Euclidean distance)  .800.003  .814.008  .605.053  .401.061 
SpectralNet (code space, Siamese distance)  .971.001  .924.001  .803.006  .532.010 
To evaluate how well the outputs of SpectralNet approximate the true eigenvectors of the graph Laplacian, we compute the Grassmann distance between the subspace of SpectralNet outputs and that of the true eigenvectors. The squared Grassmann distance measures the sum of squared sines of the angles between two dimensional subspaces; the distance is in . Figure 2 shows the Grassmann distance on the MNIST dataset as a function of the training time (expressed as number of parameter updates). It can be seen that the distance decreases rapidly at the beginning of training and stabilizes around 0.026 as time progresses.
To check the generalization ability of SpectralNet to new test points, we repeated the experiment, this time training SpectralNet only on the training set, and predicting the labels of the test examples by passing them through the net and associating each test example with the nearest centroid from the means that were performed on the embedding of the training examples. The accuracy on test examples was .970, implying that SpectralNet generalizes well to unseen test data in this case. We similarly also evaluated the generalization performance of kmeans. The accuracy of kmeans on the test set is .546 when using the input space and .776 when using the code space, both significantly inferior to SpectralNet.
5.2.2 Reuters
The Reuters dataset is a collection of English news, labeled by category. Like DEC and VaDE, we used the following categories: corporate/industrial, government/social, markets, and economics as labels and discarded all documents with multiple labels. Each article is represented by a tfidf vector, using the 2000 most frequent words. The dataset contains documents. Performing vanilla spectral clustering on a dataset of this size in a standard way is prohibitive. The AE used to map the data to code space was trained based on a random subset of 10,000 samples from the full dataset. To construct positive pairs for the Siamese net, we randomly sampled 300,000 examples from the entire dataset, and paired each one with a random neighbor from its 3000 nearest neighbors. An equal number of negative pairs was obtained by randomly pairing each point with one of the remaining points.
Table 1
shows the performance of the various algorithms on the Reuters dataset. Overall, we see similar behavior to what we observed on MNIST: SpectralNet outperforms all other methods, and performs best in code space, and using Siamese affinity. Our SpectralNet implementation took less than 20 minutes to learn the spectral map on this dataset, using a GeForce GTX 1080 GPU. For comparison, computing the top four eigenvectors of the Laplacian matrix of the complete data, needed for spectral clustering, took over 100 minutes using ARPACK. Note that both SpectralNet and spectral clustering require precomputed nearest neighbor graph. Moreover, spectral clustering using the ARPACK eigenvectors failed to produce reasonable clustering. This illustrates the robustness of our method in contrast to the well known instability of spectral clustering to outliers.
To evaluate the generalization ability of SpectralNet, we divided the data randomly to a 90%10% split, retrained the Siamese net and SpectralNet on the larger subset, and predicted the labels of the smaller subset. The test accuracy was 0.798, implying that as on MNIST, SpectralNet generalizes well to new examples.
5.3 Semisupervised learning
SpectralNet can be extended to also leverage labeled data points, when such points are available. This can be done by setting the affinity between labeled points according to their true labels, rather than using (6). Figure 3 shows a demonstration in 2D, where SpectralNet fails to recognize the true cluster structure, due to the large amount of noise in the data. However, using labels for randomly chosen 2% of the points allows SpectralNet to recognize the right cluster structure, despite the noise.
Illustrative 2D demo for semisupervised learning using SpectralNet. Left: SpectralNet fails to recognize the true cluster structure, due to the heavy noise. Right: using randomly chosen 2% of the labels the true cluster structure is recognized.
We remark that the labeled points can also utilized in additional ways, such as to enrich the training set of the Siamese net by constructing positive and the negative pairs of labeled points based on their label, and by adding cross entropy term to spectralNet loss.
6 Conclusions
We have introduced SpectralNet, a deep learning approach for approximate spectral clustering. The stochastic training of SpectralNet allows us to scale to larger datasets than what vanilla spectral clustering can handle, and the parametric map obtained from the net enables straightforward out of sample extension. In addition, we propose to use unsupervised Siamese networks to compute distances, and empirically show that this results in better performance, comparing to standard Euclidean distances. Further improvement are achieved by applying our network to code representations produced with a standard stacked autoencoder. We present a novel analysis of the VC dimension of spectral clustering, and derive a lower bound on the size of neural nets that compute it. In addition, we report state of the art results on two benchmark datasets, and show that SpectralNet outperforms existing methods when the clusters cannot be contained in non overlapping convex shapes. We believe the integration of spectral clustering with deep learning provides a useful tool for unsupervised deep learning.
Acknowledgements
We thank Raphy Coifman and Sahand Negahban for helpful discussions.
References
 Alzate & Suykens (2010) Carlos Alzate and Johan AK Suykens. Multiway spectral clustering with outofsample extensions through weighted kernel pca. IEEE transactions on pattern analysis and machine intelligence, 32(2):335–347, 2010.
 Bengio et al. (2004) Yoshua Bengio, Jeanfrançcois Paiement, Pascal Vincent, Olivier Delalleau, Nicolas L Roux, and Marie Ouimet. Outofsample extensions for lle, isomap, mds, eigenmaps, and spectral clustering. In Advances in neural information processing systems, pp. 177–184, 2004.
 Cai et al. (2011) Deng Cai, Xiaofei He, and Jiawei Han. Locally consistent concept factorization for document clustering. IEEE Transactions on Knowledge and Data Engineering, 23(6):902–913, 2011.
 Chen (2015) Gang Chen. Deep learning with nonparametric clustering. arXiv preprint arXiv:1501.03084, 2015.
 Coifman & Lafon (2006a) Ronald R Coifman and Stéphane Lafon. Diffusion maps. Applied and computational harmonic analysis, 21(1):5–30, 2006a.
 Coifman & Lafon (2006b) Ronald R Coifman and Stéphane Lafon. Geometric harmonics: a novel tool for multiscale outofsample extension of empirical functions. Applied and Computational Harmonic Analysis, 21(1):31–52, 2006b.
 Dilokthanakul et al. (2016) Nat Dilokthanakul, Pedro AM Mediano, Marta Garnelo, Matthew CH Lee, Hugh Salimbeni, Kai Arulkumaran, and Murray Shanahan. Deep unsupervised clustering with gaussian mixture variational autoencoders. arXiv preprint arXiv:1611.02648, 2016.
 Dizaji et al. (2017) Kamran Ghasedi Dizaji, Amirhossein Herandi, and Heng Huang. Deep clustering via joint convolutional autoencoder embedding and relative entropy minimization. arXiv preprint arXiv:1704.06327, 2017.
 Fowlkes et al. (2004) Charless Fowlkes, Serge Belongie, Fan Chung, and Jitendra Malik. Spectral grouping using the nystrom method. IEEE transactions on pattern analysis and machine intelligence, 26(2):214–225, 2004.
 Hadsell et al. (2006) Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensionality reduction by learning an invariant mapping. In Computer vision and pattern recognition, 2006 IEEE computer society conference on, volume 2, pp. 1735–1742. IEEE, 2006.
 Han & Filippone (2017) Yufei Han and Maurizio Filippone. Minibatch spectral clustering. In Neural Networks (IJCNN), 2017 International Joint Conference on, pp. 3888–3895. IEEE, 2017.
 Hershey et al. (2016) John R Hershey, Zhuo Chen, Jonathan Le Roux, and Shinji Watanabe. Deep clustering: Discriminative embeddings for segmentation and separation. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, pp. 31–35. IEEE, 2016.
 Hu et al. (2017) Weihua Hu, Takeru Miyato, Seiya Tokui, Eiichi Matsumoto, and Masashi Sugiyama. Learning discrete representations via information maximizing self augmented training. arXiv preprint arXiv:1702.08720, 2017.

Law et al. (2017)
Marc T Law, Raquel Urtasun, and Richard S Zemel.
Deep spectral clustering learning.
In
International Conference on Machine Learning
, pp. 1985–1994, 2017.  Mishne et al. (2017) Gal Mishne, Uri Shaham, Alexander Cloninger, and Israel Cohen. Diffusion nets. Applied and Computational Harmonic Analysis, 2017.
 Munkres (1957) James Munkres. Algorithms for the assignment and transportation problems. Journal of the society for industrial and applied mathematics, 5(1):32–38, 1957.
 Nadler et al. (2006) Boaz Nadler, Stephane Lafon, Ioannis Kevrekidis, and Ronald R Coifman. Diffusion maps, spectral clustering and eigenfunctions of fokkerplanck operators. In Advances in neural information processing systems, pp. 955–962, 2006.

Ng et al. (2002)
Andrew Y Ng, Michael I Jordan, and Yair Weiss.
On spectral clustering: Analysis and an algorithm.
In Advances in neural information processing systems, pp. 849–856, 2002.  Shaham & Lederman (2018) Uri Shaham and Roy R Lederman. Learning by coincidence: Siamese networks and common variable learning. Pattern Recognition, 74:52–63, 2018.
 ShalevShwartz & BenDavid (2014) Shai ShalevShwartz and Shai BenDavid. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
 Shamir (2015) Ohad Shamir. A stochastic pca and svd algorithm with an exponential convergence rate. In Proceedings of the 32nd International Conference on Machine Learning (ICML15), pp. 144–152, 2015.
 Shi & Malik (2000) Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Transactions on pattern analysis and machine intelligence, 22(8):888–905, 2000.
 Tian et al. (2014) Fei Tian, Bin Gao, Qing Cui, Enhong Chen, and TieYan Liu. Learning deep representations for graph clustering. In AAAI, pp. 1293–1299, 2014.
 Von Luxburg (2007) Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17(4):395–416, 2007.
 Xie et al. (2016) Junyuan Xie, Ross Girshick, and Ali Farhadi. Unsupervised deep embedding for clustering analysis. In International Conference on Machine Learning (ICML), 2016.
 Yang et al. (2017) Bo Yang, Xiao Fu, Nicholas D Sidiropoulos, and Mingyi Hong. Towards kmeansfriendly spaces: Simultaneous deep learning and clustering. International Conference on Machine Learning (ICML), 2017.
 Yang et al. (2016) Jianwei Yang, Devi Parikh, and Dhruv Batra. Joint unsupervised learning of deep representations and image clusters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5147–5156, 2016.
 Yi et al. (2016) Li Yi, Hao Su, Xingwen Guo, and Leonidas Guibas. Syncspeccnn: Synchronized spectral cnn for 3d shape segmentation. arXiv preprint arXiv:1612.00606, 2016.
 Zheng et al. (2016) Yin Zheng, Huachun Tan, Bangsheng Tang, Hanning Zhou, et al. Variational deep embedding: A generative approach to clustering. arXiv preprint arXiv:1611.05148, 2016.
Appendix A Illustrative datasets
To compare SpectralNet to spectral clustering, we consider a simple dataset of 1500 points in two dimensions, containing two nested ‘C’. We applied spectral clustering to the dataset by computing the eigenvectors of the unnormalized graph Laplacian corresponding to the two smallest eigenvalues, and then applying means (with =2) to these eigenvector embeddings. The affinity matrix was computed using where the scale was set to be the median distance between a point to its 3rd neighbor – a standard practice in diffusion applications.
Figure 4 shows the clustering of the data obtained by SpectralNet, standard spectral clustering, and means. It can be seen that both SpectralNet and spectral clustering identify the correct cluster structure, while means fails to do so. Moreover, despite the stochastic training, the net outputs closely approximate the two true eigenvectors of with smallest eigenvalues. Indeed the Grassmann distance between the net outputs and the true eigenvectors approaches zero as the loss decreases.
In the next experiment, we trained, DCN, VaDE, DEPICT (using agglomerative clustering initialization) and IMSAT (using adversarial perturbations for data augmentation) on the 2D datasets of Figure 1. The experiments were performed using the code published by the authors of each paper. For each method, we tested various network architectures and hyperparameter settings. Unfortunately, we were unable to find a setting that will yield an appropriate clustering on any of the datasets for DCN, VaDE and DEPICT. IMSAT worked on two out of the five datasets, however failed to yield an appropriate clustering in fairly simple cases. Plots with typical results of each of the methods on each of the five 2D datasets is shown in Figure 5.
To further investigate why these methods fail, we performed a sequence of experiments with the two nested ’C’s data, while changing the distance between the two clusters. The results are shown in Figure 6. We can see that all three methods fail to cluster the points correctly once the clusters cannot be separated using nonoverlapping convex shapes.
Interestingly, although the target distribution of DEPICT was initialized with agglomerative clustering, which successfully clusters the nested ’C’s, its target distribution becomes corrupted throughout the training, although its loss is considerably reduced, see Figure 7.
Appendix B Correctness of the decomposition
We next verify that the Cholesky decomposition can indeed be used to compute the QR decomposition of a positive definite matrix. First, observe that since is lower triangular, then so is , and is upper triangular. Hence for , the column space of the first columns of is the same as the column space of the first columns of . To show that the columns of corresponds to GramSchmidt orthogonalization of the columns of , it therefore remains to show that . Indeed:
Appendix C Section 4 proofs
c.1 Preliminaries
To prove Theorem 4.1, we begin with the following definition and lemmas.
Definition C.1 (separated graph).
Let . An separated graph is , where has an even number of vertices and has a balanced partition , , and is an affinity matrix so that:

For any (resp. ), there is a path , so that for every two consecutive points along the path, .

For any , .
Lemma C.2.
For any integer there exists a set (), so that for any binary partition , we can construct a set of points, , and a balanced binary partition , of it, such that


For any (resp. ), there is a path , so that for every two consecutive points along the path, (property a).

For any , (property b).
Proof.
We will prove this for the case ; the proof holds for any .
Let be integer. We choose the set to lie in a 2dimensional unit grid inside a square of minimal diameter, which is placed in the plane. Each point is at a distance 1 from its neighbors.
Next, given a partition of to two subsets, and , we will construct a set with points and a partition that satisfy the conditions of the lemma (an illustration can be seen in Figure 8). First, we add points to obtain a balanced partition. We do so by adding new points , assigning each of them arbitrarily to either or until . We place all these points also in grid points in the plane so that all points lie inside a square of minimal diameter. We further add all the points in to and those in to .
In the next step, we prepare a copy of the points at (with the same coordinates) and a copy of the points at . We denote these copies by and refer to the lifted points at by and at by . Next, we will add more points to make the full set of points satisfy properties a and b. First, we will add the midpoint between every point and its copy, i.e., . We assign each such midpoint to (resp. ) if it is placed between and (resp. and ). Then we connect the points in (resp. ) by a minimal length spanning tree and add more points along the edges of these two spanning trees so that the added points are equally spaced along every edge. We assign the new points on the spanning tree of to and of to .
We argue that the obtained point set of size satisfies the conditions of the lemma. Clearly, and . To show that property a is satisfied, note that the length of each spanning tree cannot exceed , since the full grid points can be connected with a tree of length . It is evident therefore that every two points (resp. ) are connected by a path in which the distance between each two consecutive points is strictly less than 1 (property a). Property b too is satisfied because the grid points in are at least distance 1 apart; each midpoint is distance 1/2 from and (and they all belong to the same set, either or ), but its distance to the rest of the points in exceeds 1, and the rest of the points in (resp. ) are on the (resp. ) plane, and so they are at least distance 1 away from members of the opposite set which all lie in the (resp. ) half space. ∎
Lemma C.3.
. Let be the spectral clustering loss
Let be a ,separated graph, such that . Let be a minimizer of w.r.t , subject to . Let
and similarly
Let . Then
Proof.
Without loss of generality, assume that , , and that and . Also wlog, . We begin by lowerbounding .
Since is separated, there exists a path from to (and likewise from to ) where the affinity of every pair of consecutive points exceeds . Denote this path by (resp. ), therefore
Note that these are telescopic sums of squares. Clearly, such sum of squares is minimized if all points are ordered and equidistant, i.e., if we divide a segment of length into segments of equal length. Consequently, discarding the second summand,
Next, to produce an upper bound, we consider the vector , i.e., for , and otherwise. For this vector,
In summary, we obtain
Hence
∎
Lemma C.4.
Let be a vector such that , and . Let , .
and similarly
Let . If , then
Proof.
Let
Since , we have . Without loss of generality, assume that . For every such that ,
(7) 
Similarly, for every such that ,
This gives
which gives
In order to obtain the desired result, i.e., that , it therefore remains to show that for a sufficiently small , by (7), (this will also yield ). Hence, we will require
which holds for . ∎
c.2 Proof of Theorem 4.1
Proof.
To determine the VCdimension of we need to show that there exists a set of points (assuming for simplicity that is divisible by 10) that is shattered by spectral clustering. By Lemma C.2, there exists a set of points ( so that for any dichotomy of there exists a set of points, with a balanced partition that respects the dichotomy of , and whose points satisfy properties a and b of Lemma C.2 with .
Consider next the complete graph whose vertices correspond to point and the affinity matrix is set with the standard Gaussian affinity , where the value of will be provided below. It can be readily verified that, due to properties a and b, is separated, where
.
Let be the secondsmallest eigenvector of the graph Laplacian matrix for , i.e., the minimizer of
By Lemma C.3, since is ()separated, , i.e, the spread of the entries of for the partition , should satisfy
Notice that
allowing us to make arbitrarily small by pushing the scale towards 0^{3}^{3}3We note that Theorem 4.1 also holds with constant , in which case we can instead uniformly scale the point locations of and respectively .. In particular, we can set so as to make satisfy . Therefore, by lemma (C.4), thresholding at 0 respects the partition of , and hence also the dichotomy of .
In summary, we have shown that any dichotomy of can be obtained from a secondsmallest eigenvector of some graph Laplacian of points. Hence the VC dimension of is at least . ∎
Appendix D Technical details
Siamese net  SpectralNet  

MNIST  ReLU, size = 1024  ReLU, size = 1024 
ReLU, size = 1024  ReLU, size = 1024  
ReLU, size = 512  ReLU, size = 512  
ReLU, size = 10  tanh, size = 10  
  orthonorm  
Reuters  ReLU, size = 512  ReLU, size = 512 
ReLU, size = 256  ReLU, size = 256  
ReLU, size = 128  tanh, size = 4  
  orthonorm 
MNIST  MNIST  Reuters  Reuters  
Siamese  SpectralNet  Siamese  SpectralNet  
Batch size  128  1024  128  2048 
Ortho. batch size    1024    2048 
Initial LR  
LR decay  .1  .1  .1  .1 
Optimizer  RMSprop  RMSprop  RMSprop  RMSprop 
Patience epochs  10  10  10  10 
For means we used Python’s sklearn.cluster; we used the default configuration (in particular, 300 iterations of the algorithm, 10 restarts from different centroid seeds, final results are from the run with the best objective). To perform spectral clustering, we computed an affinity matrix using (6), with the number of neighbors set to 25 and the scale set to the median distance from each point to its 25th neighbor. Once was computed, we took the eigenvectors of corresponding to the smallest eigenvalues, and then applied means to that embedding. The means configuration was as above. In our experiments, the loss (3) was computed with a factor of rather than , for numerical stability. The architectures of the Siamese net and SpectralNet are described in Table 2. Additional technical details are shown in Table 3.
The learning rate policy for all nets was determined by monitoring the loss on a validation set (a random subset of the training set); once the validation loss did not improve for a specified number of epochs (see patience epochs in Table 3), we divided the learning rate by 10 (see LR decay in Table 3). Training stopped once the learning rate reached . Typical training took about 100 epochs for a Siamese net and less than 20,000 parameter updates for SpectralNet, on both MNIST and Reuters.
In the MNIST experiments, the training set for the Siamese was obtained by pairing each data point with its two nearest neighbors (in Euclidean distance). During the training of the spectral map, we construct the batch affinity matrix by connecting each point to its nearest two neighbors in the Siamese distance. The scale in (6) was set to the median of the distances from each point to its nearest neighbor.
In the Reuters experiment, we obtained the training set for the Siamese net by pairing each point from that set to a random point from its 100 nearest neighbors, found by approximate nearest neighbor algorithm^{4}^{4}4https://github.com/spotify/annoy. To evaluate the generalization performance, the Siamese nets were trained using training data only. The scale in (6) was set globally to the median (across all points in the dataset) distance from any point to its 10th neighbor.
Finally, we used the validation loss to determine the hyperparameters. To demonstrate that indeed the validation loss is correlated to clustering accuracy, we conducted a series of experiments with the MNIST dataset, where we varied the net architectures and learning rate policies; the Siamese net and Gaussian scale parameter were held fixed throughout all experiments. In each experiment, we measured the loss on a validation set and the clustering accuracy (over the entire data). The correlation between loss and accuracy across these experiments was 0.771. This implies that hyperparameter setting for the spectral map learning can be chosen based on the validation loss, and a setup that yields a smaller validation loss should be preferred. We remark that we also use the convergence of the validation loss to determine our learning rate schedule and stopping criterion.
Comments
There are no comments yet.