Clustering unlabeled data is an important problem from both a scientific and practical perspective. As technology plays a larger role in daily life, the volume of available data has exploded. However, labeling this data remains very costly and often requires domain expertise. Therefore, unsupervised clustering methods are one of the few viable approaches to gain insight into the structure of these massive unlabeled datasets.
One of the most popular clustering methods is spectral clustering[20, 17, 21]
, which first embeds the similarity of each pair of data points in the Laplacian’s eigenspace and then uses k-means to generate clusters from it. Spectral clustering not only outperforms commonly used clustering methods, such as k-means
, but also allows us to directly minimize the pairwise distance between data points and solve for the optimal node embeddings analytically. Moreover, it is shown that the eigenvector of the normalized Laplacian matrix can be used to find the approximate solution to the well known normalized cuts problem[17, 21].
In this work, we introduce CNC, a framework for Clustering by learning to optimize expected Normalized Cuts. We show that by directly minimizing a continuous relaxation of the normalized cuts problem, CNC enables end-to-end learning approach that outperforms top-performing clustering approaches. We demonstrate that our approach indeed can produce lower normalized cut values than the baseline methods such as SpectralNet, which consequently results in better clustering accuracy.
Let us motivate CNC through a simple example. In Figure 1, we want to cluster 6 images from CIFAR-10 dataset into two clusters. The affinity graph for these data points is shown in Figure 1(a) (details of constructing such graph is discussed in Section 4.2). In this example, it is obvious that the optimal clustering is the result of cutting the edge connecting the two triangles. Cutting this edge will result in the optimal value for the normalized cuts objective. In CNC
, we define a new differentiable loss function equivalent to the expected normalized cuts objective. We train a deep learning model to minimize the proposed loss in an unsupervised manner without the need for any labeled datasets. Our trained model directly returns the probabilities of belonging to each cluster (Figure1(b)). In this example, the optimal normalized cuts is 0.286 (Equation 1), and as we can see, the CNC loss also converges to this value (Figure 1(c)).
We compare the performance of CNC to several learning-based clustering approaches (SpectralNet , DEC , DCN , VaDE , DEPICT , IMSAT , and IIC ) on four datasets: MNIST, Reuters, CIFAR10, and CIFAR100. Our results show up to 10.9% improvement over the baselines. Moreover, generalizing spectral embeddings to unseen data points, a task commonly referred to as out-of-sample-extension (OOSE), is a non-trivial task [3, 1, 15]. Our results confirm that CNC generalizes to unseen data. Our generalization results are superior (by up to 21.9%) to SpectralNet , the recent top-performing clustering approach with the ability to generalize.
2 Related Work
Recent deep learning approaches to clustering attempt to embed the input data into a form that is amenable to clustering by k-means or Gaussian Mixture Models.[23, 22] focused on learning representations for clustering. To find the clustering-friendly latent representations and to better cluster the data, DCN 
proposed a joint dimensionality reduction (DR) and K-means clustering approach in which DR is accomplished via learning a deep neural network. DEC simultaneously learns cluster assignment and the underlying feature representation by iteratively updating a target distribution to sharpen cluster associations.
Several other approaches rely on a variational autoencoder that utilizes a Gaussian mixture prior[12, 7, 10, 11, 2]. These approaches are mainly based on data augmentation, where the network is trained to maximize the mutual information between inputs and predicted clusters, while regularizing the network so that the cluster assignment of the data points is consistent with the assignment of the augmented points.
. One of the most effective techniques is spectral clustering, which first generates node embeddings in the eigenspace of the graph Laplacian, and then applies k-means clustering to these vectors[20, 17, 21]. To address the fact that clusters with the lowest graph conductance tend to have few nodes [13, 24],  proposed regularized spectral clustering to encourage more balanced clusters.
Generalizing clustering to unseen nodes and graphs is nontrivial [3, 1, 15]. A recent work, SpectralNet , takes a deep learning approach to spectral clustering that generalizes to unseen data points. This approach first learns embeddings of the similarity of each pair of data points in Laplacian’s eigenspace and then applies k-means to those embeddings to generate clusters. Unlike SpectralNet, we propose an end-to-end learning approach with a differentiable loss that directly minimizes the normalized cuts. We show that our approach indeed can produce lower normalized cut values than the baseline methods such as SpectralNet, which consequently results in better clustering accuracy. Our evaluation results show that CNC improves generalization accuracy on unseen data points by up to 21.9%.
Since CNC objective is based on optimizing normalized cuts, in this section, we briefly overview the formal definition of this metric.
3.1 Formal definition of Normalized cuts
Let be a graph where and are the set of nodes and edges in the graph and is the edge weight of the . Let be the number of nodes. A graph can be clustered into disjoint sets , where the union of the nodes in those sets are (), and each node belongs to only one set (), by simply removing edges connecting those sets. For example, in Figure 1(a), by removing one edge two disjoint clusters are formed.
Where represents the complement of , i.e., . is called cut and is the total weight of the edges that are removed from in order to form disjoint sets and . is the total edge weights (), whose end points (, or ) belong to . The cut and vol are:
Note that in Equation 2, and are disjoint, i.e., , while in vol, . In running example (Figure 1), since the edge weights are one, , and . Thus the . In this example one can see that such clustering results in minimum value of the normalized cuts. CNC aims to find a cut that the normalized cuts (Equation 1) is minimized.
4 CNC Framework
Finding the cluster assignments that minimizes the normalized cuts is NP-complete and an approximation to the this problem is based on the eigenvectors of the normalized graph Laplacian which has been studied in [20, 24]. CNC, on the other hand, is a neural network framework for learning to cluster in the absence of labeled examples by directly minimizing the continuous relaxation of the normalized cuts. As shown in Algorithm 1, end-to-end training of the CNC contains two steps, i.e, (i) data points embedding (line 3), and (ii) clustering (lines 4-9). In data points embedding, the goal is to learn embeddings that capture the affinity of the data points, while the clustering step uses those embeddings to learn the CNC model and outputs the cluster assignments. Next, we first focus on the clustering step and we introduce our new diffrentiable loss function to train CNC model. Later in Section 4.2, we discuss the details of the embedding step.
4.1 Clustering Step: Learn CNC model
In this section, we describe the clustering step in Algorithm 1 (lines 4-9). for each data point , the input to clustering step is embedding (detail in Section 4.2). The goal is to learn CNC model that for a given embedding it returns , which represents the assignment probabilities over clusters. Clearly for data points, it returns where represents the probability that belongs to cluster . The CNC model is implemented using a neural network, where the parameter vector denotes the network weights. We propose a loss function based on output to calculate the expected normalized cuts. Thus CNC learns the by minimizing this loss.
Recall that is the total weight of the edges that are removed from in order to form disjoint sets and . In our setup, embeddings are the nodes in graph , and neighbors of an embedding are based on the k-nearest neighbors. Let be the probability that node belongs to cluster . The probability that node does not belong to would be . Therefore, can be formulated by Equation 3, where is the set of nodes adjacent to .
Since the weight matrix represents the edge weights adjacent nodes, we can rewrite Equation 3:
The element-wise product with the weight matrix ensures that only the adjacent nodes are considered. Moreover, the result of is an matrix and reduce-sum is the sum over all of its elements.
From Equation 2, is the total edge weights (), whose end points (, or ) belong to . Let be a column vector of size where is the total edge weights from node . Given , we can calculate the as follows:
where is a vector in , and is the number of sets/clusters.
is element-wise division and the result of is a matrix where reduce-sum is the sum over all of its elements.
As you can see the affinity graph is part of the CNC loss (Equation 6). Clearly, when the number of data points () is large, such calculation can be expensive. However, in our experimental results, we show that for large dataset (e.g., Reuters contains 685,071 documents), it is possible to optimize the loss on randomly sampled minibatches of data. We also build the affinity graph over a given minibach using the embeddings and based on their k nreast-neighbor (Algorithm 1 (lines 5-6)). Specifically, in our implementation, CNC model is a fully connected layer followed by gumble softmax, trained on randomly sampled minibatches of data to minimize Equation 6. When training is over, the final assignment of a node to a cluster is the of (Algorithm 1 (line 9)).
4.2 Embedding Step
In this section, we discuss the embedding step (line 3 in Algorithm 1). Different affinity measures, such as simple euclidean distance or nearest neighbor pairs combined with a Gaussian kernel, have been used in spectral clustering. Recently it is shown that unsupervised application of a Siamese network to determine the distances improves the quality of the clustering .
In this work, we also use Siamese networks to learn embeddings that capture the affinities of the data points. Siamese network is trained to learn an adaptive nearest neighbor metric. It learns the affinities directly from euclidean proximity by ”labeling” points , positive if is small and negative otherwise. In other words, it generates embeddings such that adjacent nodes are closer in the embedding space and non-adjacent nodes are further. Such network is typically trained to minimize contrastive loss:
where , and is a Siamese network that transforms representations in the input space to embeddings .
The main goals of our experiments are to evaluate: (a) The performance of CNC against the existing clustering approaches. (b) The ability of CNC to generalize to unseen data compared to the top-performing generalizable baseline. (c) The effectiveness of minimizing Normalized cuts on improving the clustering results. (d) The generalization performance of CNC as we vary the number of data points in training dataset.
5.1 Datasets and Baseline Methods
We evaluate the performance of CNC in comparison to several deep learning-based clustering approaches on four real world datasets: MNIST, Reuters, CIFAR-10, and CIFAR-100. The details of the datasets are as follows:
MNIST is a collection of 70,000 28×28 gray-scale images of handwritten digits, divided into 60,000 training images and 10,000 test images.
The Reuters dataset is a collection of English news labeled by category. Like SpectralNet, DEC, and VaDE, we used the following categories: corporate/industrial, government/social, markets, and economics as labels and discarded all documents with multiple labels. Each article is represented by a tfidf vector using the 2000 most frequent words. The dataset contains 685,071 documents. We divided the data randomly to a 90%-10% split to evaluate the generalization ability of CNC. We also investigate the imapact of training data size on the generalization by considering following splits: 90%-10%, 70%-30%, 50%-50%, 20%-80%, and 10%-90%.
CIFAR-10 consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.
CIFAR-100 has 100 classes containing 600 images each with a 500/100 train/test split per class.
In all runs we assume the number of clusters is given. In MNIST and CIFAR-10 number of clusters (g) is 10, g = 4 in Reuters, g = 100 in CIFAR-100. We compare CNC to SpectralNet , DEC , DCN , VaDE , DEPICT , IMSAT , and IIC . While [23, 22] focused on learning representations for clustering, other approaches [12, 7, 10, 11, 2] rely on a variational autoencoder that utilizes a Gaussian mixture prior. SpectralNet , takes a deep learning approach to spectral clustering that generalizes to unseen data points. Table 1 shows the results reported for these six methods.
Similar to , for MNIST and Reuters we use publicly available and pre-trained autoencoders111https://github.com/slim1017/VaDE/tree/master/pretrain_weights. The autoencoder used to map the Reuters data to code space was trained based on a random subset of 10,000 samples from the full dataset. Similar to 
, for CIFAR-10 and CIFAR-100 we applied 50-layer pre-trained deep residual networks trained on ImageNet to extract features and used them for clustering.
5.2 Performance Measures
We use two commonly used measures, the unsupervised clustering accuracy (ACC), and the normalized mutual information (NMI) in  to evaluate the accuracy of the clustering. Both ACC and NMI are in [0, 1], with higher values indicating better correspondence the clusters and the true labels. Note that true labels never used neither in training, nor in test.
Clustering Accuracy (ACC): For data points , let and be the true labels and predicted clusters respectively. The ACC is defined as:
where is the collection of all permutations of . The optimal permutation can be computed using the Kuhn-Munkres algorithm .
Normalized Mutual Information (NMI): Let be the mutual information between and , and (.) be their entropy. The NMI is:
5.3 Experimental Results
For each dataset we trained a Siamese network [9, 18] to learn embeddings which represents the affinity of data points by only considering the k-nearest neighbors of each data. In Table 1, we compare clustering performance across four benchmark datasets. Since most of the clustering approaches do not generalize to unseen data points, all data has been used for the training (Later in Section 5.4, to evaluate the generalizability we use 90%-10% split for training and testing).
While the improvement of CNC is marginal over MNIST, it performs better across other three datasets. Specifically, over CIFAR-10, CNC outperforms SpectralNet and IIC on ACC by 20.1% and 10.9% respectively. Moreover, the NMI is improved by 12.3%. The results over Reuters, and CIFAR-100, show 0.021% and 11% improvement on ACC. The NMI is also 27% better over CIFAR-100. The fact that our CNC outperforms existing approaches in most datasets suggests the effectiveness of using our deep learning approach to optimize normalized cuts for clustering.
We further evaluate the generalization ability of CNC by dividing the data randomly to a 90%-10% split and training on the training set and report the ACC and NMI on the test set (Table 2). Among seven methods in Table 1, only SpectralNet is able to generalize to unseen data points. CNC outperforms SpectralNet in most datasets by up to 21.9% on ACC and up to 10.7% on NMI. Note that simple over the output of CNC retrieves the clustering assignments while SpectralNet relies on k-means to predict the final clusters.
5.5 Impact of Normalized cuts in clustering
To evaluate the impact of normalized cuts for the clustering task, we calculate the numerical value of the Normalized cuts (Equation 1) over the clustering results of the CNC and SpectralNet. Since such calculation over whole dataset is very expensive we only show this result over the test set.
Table 3 shows the numerical value of the Normalized cuts over the clustering results of the CNC and SpectralNet. As one can see CNC is able to find better cuts than the SpectralNet. Moreover, we observe that for those datasets that the improvement of the CNC is marginal (MNIST and Reuters), the normalized cuts of CNC are also only slightly better than the SpectralNet, while for the CIFAR-10 and CIFAR-100 that the accuracy improved significantly the normalized cuts of CNC are also much smaller than SpectralNet. The higher accuracy (ACC in Table 2) and smaller normalized cuts (Table 3), verify that indeed CNC loss function is a good notion for clustering task.
5.6 Imapact of training data size on the generalization
As you may see in generalization result (Table 2), when we reduce the size of the training data to 90% the accuracy of CNC slightly changed in compare to training over the whole data (Table 1). Based on this observation, we next investigate how varying the size of the training dataset affects the generalization. In other words, how ACC and NMI of test data change when we vary the size of the training dataset.
We ran experiment over Routers dataset by dividing the data randomly based on the following data splits: 90%-10%, 70%-30%, 50%-50%, 20%-80%, and 10%-90%. For example, in 10%-90%, we train CNC over 10% of the data and we report the ACC and NMI of CNC over the 90% test set. Figure 3 shows how the ACC and NMI of CNC over the test data change as the size of the training data is varied. For example, when the size of the training data is 90%, the ACC of CNC over the test data is 0.824.
As we expected and shown in Figure 3 the ACC and NMI of CNC increased as the size of the training data is increased. Interestingly, we observed that with only 10% training data the ACC of CNC is 0.68 which is only 14% lower than the ACC with 90% training data. Similarly the NMI of CNC with 10% training data is only 18% lower than the NMI with 90% training data.
5.7 Model Architecture and Hyper-parameters:
Here are the details of the CNC model for each dataset.
MNIST: The Siamese network has 4 layers sized [1024, 1024, 512, 10] with ReLU. The clustering module has 2 layers sized [512, 512] with a final gumbel softmax layer. Batch sized is 256 and we only consider 3 nearest neighbors to find the embeddings and constructing the affinity graph for each batch. We use Adam with lr = 0.005 with decay 0.5. Temperature starts at 1.5 and the minimum is set to 0.5.
Reuters: The Siamese network has 3 layers sized [512, 256, 128] with ReLU. The clustering module has 3 layers sized [512, 512, 512] with tanh activation and a final gumbel softmax layer. Batch sized is 128 and we only consider 3 nearest neighbors to find the embeddings and constructing the affinity graph for each batch. We use Adam with lr = 1e-4 with decay 0.5. Temperature starts at 1.5 and the minimum is set to 1.0.
CIFAR-10: The Siamese network has 2 layers sized [512, 256] with ReLU. The clustering module has 2 layers sized [512, 512] with tanh activation and a final gumbel softmax layer. Batch sized is 256 and we only consider 2 nearest neighbors to find the embeddings and constructing the affinity graph for each batch. We use Adam with lr = 1e-4 with decay 0.1. Temperature starts at 2.5 and the minimum is set to 0.5.
CIFAR-100: The Siamese network has 2 layers sized [512, 256] with ReLU. The clustering module has 3 layers sized [512, 512, 512] with tanh activation and a final gumbel softmax layer. Batch sized is 1024 and we only consider 3 nearest neighbors to find the embeddings and constructing the affinity graph for each batch. We use Adam with lr = 1e-3 with decay 0.5. Temperature starts at 1.5 and the minimum is set to 1.0.
We propose CNC (Clustering by learning to optimize Normalized Cuts), a framework for learning to cluster unlabeled examples. We define a differentiable loss function equivalent to the expected normalized cuts and use it to train CNC model that directly outputs final cluster assignments. CNC achieves state-of-the-art results on popular unsupervised clustering benchmarks (MNIST, Reuters, CIFAR-10, and CIFAR-100 and outperforms the strongest baselines by up to 10.9%. CNC also enables generation, yielding up to 21.9% improvement over SpectralNet , the previous best-performing generalizable clustering approach.
Manifold regularization: a geometric framework for learning from labeled and unlabeled examples.
Journal of machine learning research7, pp. 2399–2434. Cited by: §1, §2.
-  (2018) Gaussian mixture generative adversarial networks for diverse datasets, and the unsupervised clustering of images. CoRR abs/1808.10356. Cited by: §2, §5.1.
-  (2003) Out-of-sample extensions for lle, isomap, mds, eigenmaps, and spectral clustering. In Proceedings of the 16th International Conference on Neural Information Processing Systems, pp. 177–184. Cited by: §1, §2.
-  (2011) Locally consistent concept factorization for document clustering. IEEE Trans. on Knowl. and Data Eng. 23 (6), pp. 902–913. Cited by: §5.2.
-  (2014) Balanced k-means and min-cut clustering. arXiv preprint arXiv:1411.6235. Cited by: §2.
-  (2017) A self-balanced min-cut algorithm for image clustering.. In ICCV, pp. 2080–2088. Cited by: §2.
-  (2016) Deep unsupervised clustering with gaussian mixture variational autoencoders. arXiv preprint arXiv:1611.02648. Cited by: §2, §5.1.
Deep clustering via joint convolutional autoencoder embedding and relative entropy minimization.
The IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §5.1, Table 1.
Dimensionality reduction by learning an invariant mapping.
Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2, CVPR ’06, pp. 1735–1742. Cited by: §5.3.
-  (2017) Learning discrete representations via information maximizing self-augmented training. In Proceedings of the 34th International Conference on Machine Learning, D. Precup and Y. W. Teh (Eds.), Vol. 70, pp. 1558–1567. Cited by: §1, §2, §5.1, §5.1, Table 1.
-  (2019) Invariant information distillation for unsupervised image segmentation and clustering. arXiv preprint arXiv: 1807.06653. Cited by: §1, §2, §5.1, Table 1.
Variational deep embedding: an unsupervised and generative approach to clustering.
Proceedings of the 26th International Joint Conference on Artificial Intelligence, IJCAI’17, pp. 1965–1972. Cited by: §1, §2, §5.1, Table 1.
-  (2009) Community structure in large networks: natural cluster sizes and the absence of large well-defined clusters. Internet Mathematics 6 (1), pp. 29–123. Cited by: §2.
-  (2017) Balanced clustering with least square regression.. In AAAI, pp. 2231–2237. Cited by: §2.
-  (2016) Extreme learning machine for out-of-sample extension in laplacian eigenmaps. Pattern Recognition 74 (C), pp. 68–73. Cited by: §1, §2.
-  (1957) Algorithms for the Assignment and Transportation Problems. Journal of the Society for Industrial and Applied Mathematics 5 (1), pp. 32–38. Cited by: §5.2.
On spectral clustering: analysis and an algorithm. In Advances in neural information processing systems, pp. 849–856. Cited by: §1, §2.
-  (2018) Learning by coincidence: siamese networks and common variable learning. Pattern Recognition 74, pp. 52–63. Cited by: §5.3.
-  (2018) SpectralNet: spectral clustering using deep neural networks. In International Conference on Learning Representations, Cited by: §1, §2, §4.2, §5.1, §5.1, Table 1, Table 2, Table 3, §6.
-  (2000-08) Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22 (8), pp. 888–905. Cited by: §1, §2, §3.1, §4.
-  (2007) A tutorial on spectral clustering. Statistics and computing 17 (4), pp. 395–416. Cited by: §1, §2.
-  (2016) Unsupervised deep embedding for clustering analysis. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, pp. 478–487. Cited by: §1, §2, §5.1, Table 1.
-  (2017) Towards k-means-friendly spaces: simultaneous deep learning and clustering. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3861–3870. Cited by: §1, §2, §5.1, Table 1.
-  (2018) Understanding regularized spectral clustering via graph conductance. In NeurIPS, pp. 10654–10663. Cited by: §2, §3.1, §4.