A Classification-Based Approach to Semi-Supervised Clustering with Pairwise Constraints

01/18/2020 ∙ by Marek Śmieja, et al. ∙ University of Lisbon Jagiellonian University 0

In this paper, we introduce a neural network framework for semi-supervised clustering (SSC) with pairwise (must-link or cannot-link) constraints. In contrast to existing approaches, we decompose SSC into two simpler classification tasks/stages: the first stage uses a pair of Siamese neural networks to label the unlabeled pairs of points as must-link or cannot-link; the second stage uses the fully pairwise-labeled dataset produced by the first stage in a supervised neural-network-based clustering method. The proposed approach, S3C2 (Semi-Supervised Siamese Classifiers for Clustering), is motivated by the observation that binary classification (such as assigning pairwise relations) is usually easier than multi-class clustering with partial supervision. On the other hand, being classification-based, our method solves only well-defined classification problems, rather than less well specified clustering tasks. Extensive experiments on various datasets demonstrate the high performance of the proposed method.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Clustering is an important unsupervised learning tool often used to analyze the structure of complex high-dimensional data. Without any additional information about the underlying class/cluster structure, clustering results may contradict our prior knowledge or assumptions about the data being analyzed.

Semi-supervised clustering (SSC) methods tackle this issue by leveraging partial prior information about class labels, with the goal of obtaining partitions that are better aligned with true classes Basu et al. (2004, 2008); Liu and Fu (2015); Cheng et al. (2007); Law et al. (2005); Lu and Leen (2004). One typical way of injecting class label information into clustering is in the form of pairwise constraints (typically, must-link and cannot-link constraints), or pair-wise preferences (e.g., should-link and shouldn’t-link), which indicate whether a given pair of points is believed to belong to the same or different classes.

Most SSC approaches rely on adapting existing unsupervised clustering methods to handle partial (namely, pairwise) information Melnykov et al. (2016); Bilenko et al. (2004); Cheng et al. (2007); Law et al. (2005); Lu and Leen (2004); Qian et al. (2017). This requires transferring class-label knowledge into a clustering algorithm, which is often unnatural and puts a higher weight on clustering structure than on class labels. It has been recently shown that discriminative clustering methods, which approach clustering problems using classification tools, are usually more effective in taking advantage of label constraints/information Pei et al. (2016); Śmieja et al. (2018). While those formulations assume that class labels are the primary source of semi-supervision, it might be difficult to produce satisfactory results in the presence of a large number of clusters. In that case, a small number of pairwise constraints may not allow to determine the correct clusters assignments.

In this paper, we go one step further than other discriminative approaches and decouple SSC into two stages:

Stage 1:

predict pairwise relations between pairs of unlabelled points (a binary classification problem), which allows assigning predicted labels to unlabeled pairs, thus increasing the number of labeled pairs,

Stage 2:

use the labeled pairs (both given and predicted) in a semi-supervised clustering method.

The rationale behind our approach follows from the observation that it is easier to learn a binary classifier than to solve a multiclass problem under partial supervision, especially when the number of classes (clusters) is high. To increase the flexibility of our framework, we instantiate it with two neural networks, specifically with so-called Siamese neural networks Bromley et al. (2005); Koch et al. (2015). The first network (LabNet–labeling network) is used to classify pairs of examples as must-link or cannot-link constraints, while the second network (CluNet–clustering network) is trained on labeled pairs to predict final clusters assignments (see Figure 1). We term our method SCSemi-Supervised Siamese Classifiers for Clustering.

Input 1

Link prediction

Input 2


Neural network

Neural network


Contrastive loss, eq. (1) or (2)
(a) LabNet: labeling network

Input 1

Clusters assignments

Input 2


Neural network

Neural network



Clustering loss, eq. (3)
(b) CluNet: clustering network
Figure 1: Illustration of the proposed SC model. The labeling network is trained to label new pairs as must-link or cannot-link constraints. The clustering network is trained on the set of pairwise constraints generated by the labeling network to predict final clusters assignments.

In the experiments reported below, we implement SC with general-purpose dense deep neural networks (DNNs) as well as with convolutional neural networks (CNNs) to handle images. In both cases, SC outperforms other neural-network-based SSC techniques. Additionally, we experimentally and theoretically analyze the impact of the networks’ parameterization on the clustering results. Our contributions are summarized as follows:

  1. a classification-based method for SSC with pairwise constraints, which first labels pairs of data points and then uses these predicted labels to perform SSC;

  2. an implementation of the proposed method with two Siamese DNNs, allowing to control the flexibility of the clustering model by adjusting the numbers of layers and neurons; the corresponding parameterization is studied theoretically and experimentally.

  3. experimental results showing the superiority of the proposed SC method over related approaches on several datasets, including Letters, which has 26 classes; to the best of our knowledge, SSC had not been tested on such a large number of classes.

The code of our method will be made publicly available online after acceptance of the paper.

2 Related work

The most common way of using pairwise constraints in SSC relies on modifying the underlying cost function of a classical unsupervised clustering models Śmieja and Wiercioch (2017); Qian et al. (2017); Lu et al. (2016)

. Such an approach was used in k-means, using a term penalizing pairwise constraint violation

Bilenko et al. (2004), and in Gaussian mixture models (GMM), with hidden Markov random fields modelling pairwise relations Basu et al. (2004); Cheng et al. (2007); Lu and Leen (2004)

. In spectral clustering, the underlying eigenvalue problem was modified by adding the pairwise constraints to the corresponding objective function

Kawale and Boley (2013); Wang and Davidson (2010). Another line of work focuses on modifying the similarity measure based on the pairwise relations Asafi and Cohen-Or (2013); Chang et al. (2014); Wang et al. (2012), by learning optimal Mahalanobis distances Davis et al. (2007); Xing et al. (2003), or more general kernel functions Yin et al. (2010).

Recently, it has been shown that discriminative clustering formulations Kaski et al. (2005) are often more effective in leveraging pairwise relations than the aforementioned methods. The authors of Pei et al. (2016)

used an analogue of the classification log-loss function based on pairwise constraints and added entropy regularization

Krause et al. (2010) to prevent degenerate solutions. In a similar spirit, Śmieja et al. (2018) maximized the expected number of correctly classified pairs based on pairwise constraints and an underlying distance function. The authors of Calandriello et al. (2014) used a squared-loss mutual information to regularize a discriminative clustering model.

Although DNNs are dominant in many areas of machine learning, their have rarely been used for SSC. The authors of

Hsu and Kira (2015) used a KL-divergence-based loss to train a DNN to predict cluster distribution from pairwise relations; one limitation of that method is its inability to use unlabeled data. Other works Fogel et al. (2019); Shukla et al. (2018); Zhang et al. (2019) used auto-encoders with reconstruction losses to exploit inner characteristics of unlabeled data. In Shukla et al. (2018), the k-means loss is combined with KL-divergence to create compact clusters preserving pairwise relations. In Fogel et al. (2019), the distance between must/cannot-link pairs was minimized/maximized, instead of using KL-divergence. Deep embedding clustering (DEC) is a method that jointly learns feature representations and cluster assignments using deep neural networks Xie et al. (2016). Finally, a method capable of using various types of side information has been proposed in Zhang et al. (2019).

Our work extends recent discriminative SCC methods Śmieja et al. (2018); Pei et al. (2016) by learning additional pairwise relations. Moreover, the approach is implemented using Siamese neural networks Bromley et al. (2005); Hadsell et al. (2006); Koch et al. (2015), allowing for higher flexibility. In contrast to the aforementioned deep SSC methods, our model is fully discriminative and uses misclassification error as the only loss term.

3 Proposed Method

3.1 Formulation

Let be a dataset, where every instance belongs to one of classes. The goal is to split into clusters, which are compatible with the true (unknown) classes.

We assume that partial class information is given in the form of pairwise constraints, indicating whether two examples belong to the same (must-link constraint) or different (cannot-link constraint) classes. Formally, the class information is expressed via a set , where , and

To make the notation lighter in what follows, we assume that always contains all the pairs of the form , because the binary relation “belong to the same class" is obviously reflexive. Furthermore, because the binary relations “belong to the same class" and “belong to different classes" are both symmetric, we also assume that

Finally, let denote the set of unlabeled pairs.

The proposed SC model is composed of two classification neural networks. The labeling network (LabNet) is trained to assign labels (must-link or cannot-link) to new pairs of examples not in . The clustering network (CluNet) is trained to use labeled pairs to predict clusters assignments. The proposed scheme is illustrated in Figure 1 and next described in detail.

3.2 The Labeling Network: LabNet

Instead of doing SSC directly, which can be a difficult multi-class problem, we first tackle a simpler binary classification problem: learning to label new pairs (i.e., not in ) as belonging to the same class (must-link) or different classes (cannot-link). By classifying instances in but not in , we obtain new must-link/cannot-link labels that will be used by CluNet to predict the final clusters assignments (as described in the next subsection).

We address this classification problem using a pair of Siamese neural networks (identical networks, i.e., with shared weights) Koch et al. (2015). The task of these networks is to take a pair of points and return their representations , based on which it will be decided if and are in the same or different classes. Naturally, they are trained to make close to , if , and distant from , if . To this end, we use a contrastive loss based on the Euclidean distance , defined as:


Notice that the presence of pairs of the form in does not contribute to because

Clearly, being a distance, , for all . Observe that a cannot-link pair contributes to the loss only if its distance is below 1, see Figure 2. A crucial aspect is that LabNet does not decide whether two points belong to the same or different classes; it only yields similarity scores for pairs of data points111Siamese networks have been used for one-shot learning Koch et al. (2015), where the class of a given example is decided by comparing the output one of the twin networks with that of the other on a set of examples of known classes.. A hard link prediction is obtained by comparing the distance with a threshold : and are classified as being in the same class if and only if . One natural choice is , because if , then as well. Below, we will explain that is usually a better choice in our case.

Figure 2: Contrastive loss of labeling network for must-link and cannot-link pairs.

In the training phase of the LabNet, only pairwise constraints are used (the loss in (1) only depends on ). To leverage information contained in unlabeled data, we consider an adaptation of SEVEN (SEmi-supervised VErification Network Noroozi et al. (2017)), yielding a semi-supervised version of LabNet. The idea is to encourage the mapping to learn a salient structure shared by all categories. For this purpose, each Siamese twin is supplied with a decoder network , which aims at obtaining a reconstruction of from its latent representation : . This goal is pursued by using a reconstruction error loss term,

Finally, the total loss used for training the semi-supervised LabNet is


where is a trade-off parameter.

Once trained, the LabNet is applied to yield pairwise constraints for all pairs of data points. Let

clearly, .

3.3 The Clustering Network: CluNet

Since, by the application of the LabNet to the unlabeled pairs yields pairwise constraints to all the pairs in the dataset, the final clustering can be obtained in a purely supervised manner. Instead of a typical unsupervised clustering method (e.g.

, k-means or GMM), we thus employ a discriminative framework, which is more effective in the supervised case. Namely, we directly model cluster assignments with posterior probabilities

, for

. From these posterior probability estimates,

may be partitioned by assigning every point to the cluster that maximizes .

To provide sufficient flexibility, we instantiate the CluNet as a Siamese pair of identical DNNs, where each pair of points is processed by two identical (Siamese twins) sub-networks with shared weights. Equipped with softmax output layers, these Siamese twin networks yield class posterior probabilities and , for each pair of items .

To form clusters consistent with pairwise constraints, we aim at minimizing the number of misclassified pairs. Note that, given the posterior class probabilities and , for a pair of points , the probability that and are in the same cluster is given by

whereas is the probability that they are in different clusters. We thus define the misclassification loss with respect to the must-link and cannot-link information as


The structure of the CluNet is shown in Figure (b)b. Whereas during the training phase, the loss function in (3) uses the Siamese pair since it applies to pairs of points, in the testing phase, only one of the networks is needed (as indicated in Figure (b)b) to produce cluster assignments, where a given point is assigned to the cluster with the highest posterior probability: .

3.4 Adjusting the LabNet Classification Threshold

We analyze the influence of the LabNet threshold on the CluNet results. Let us begin by assuming that ; in this case, all pairs in are labeled by the LabNet as cannot-links (i.e., and ) and the loss in (3) can be written as

Assuming (as will be the case in all the experiments below and is the typical scenario in SSC),



is an estimate of the probability of -th class approximated by the clustering model (notice that ). The last term in (4) is related to the index-2 Tsallis entropy Furuichi (2006)


is maximized by the uniform distribution,

is (approximately) minimized by taking equally-sized clusters. This means that by predicting a large number of cannot-link pairs (by setting ) encourages high entropy (approximately uniform) clusterings and discourages degenerate solutions.

By increasing the threshold , more pairs are classified as must-link and fewer as cannot-link. In this case,

where (above a certain value) . This can be rewritten as

where the last term is a constant that depends only on the output of the LabNet. This form of the loss function shows that: (a) it encourages pairs in to be given high probability of being classified in the same class (large ); (b) it encourages the Tsallis entropy of the estimated class probabilities to be high (low ). In other words, must-link constraints (those in ) play a more active role in this loss function, whereas cannot-link pairs (those in ) essentially only contribute to the entropic term of the loss.

The observation in the previous paragraph shows that obtaining must-link constraints is crucial for the performance of the CluNet. This is however a double-edged sword; correct must-links provide valuable information to train the CluNet, but erroneous ones may be very harmful. If two instances from different classes are wrongly put in , this directly impacts the middle term of the loss , whereas two examples from the same class that are wrongly put in essentially only affect the regularization term (first term of ), in addition to being missing from . Furthermore, erroneous must-link constraint can be implicitly propagated to other pairs due to the transitivity of the binary relation “belong to the same class", whereas the binary relation “belong to different classes" is not transitive.

The above considerations suggest that it is safer to use small values of threshold . This is especially important if the number of given pairwise constraints is small, because the accuracy of LabNet may then be low. In this case, the LabNet with a small puts in only pairs about which it is very confident. The other pairs will contribute to the entropic regularization term. If the number of given constraints is larger, we can use a higher threshold and label more pairs as must-link with higher confidence. Consequently, may be optimal only in the presence of large sets of constrains, which is seldom the case in practice. Experimental validation of this rationale is presented in Section 4.3.

Figure 3: Performance of clustering models on four data types with varied numbers of constraints.
Figure 4: Performance of SC with different threshold .

4 Experiments

In this section, we evaluate our approach SC against state-of-the-art methods and investigate the effect of the parametrization of the LabNet on the clustering results.

Figure 5: Correlation between clustering NMI and three indicators of LabNet: accuracy, ML rate, and CL rate.

4.1 Experimental setting

We consider four popular datasets with normalized attributes:

  • MNIST: It contains 70k gray scale images of handwritten digits of the size (60k for training and 10k for testing) LeCun et al. (1998). The set is divided into 10 equally-sized classes.

  • Fashion-MNIST: It is a dataset of 70k gray scale images of the size (60k for training and 10k for testing) with 10 classes Xiao et al. (2017). Images show clothing items.

  • Reuters: This dataset contains English news stories labeled with a category tree Lewis et al. (2004). Analogically to previous uses of this data in clustering, we randomly sampled a subset of 12k examples (10k for training and 2k for testing) from 4 root classes: corporate/industrial, government/social, markets and economics. Documents were represented using TF-IDF features on the 2000 most frequent words.

  • Letters: This dataset contains a description of capital letters in the English alphabet (26 classes) Frey and Slate (1991)

    . The character images were based on 20 different fonts and each letter within these 20 fonts was randomly distorted to produce 20k examples. Each example was converted into 16 primitive numerical attributes (statistical moments and edge counts). We used 15k first examples for training and remaining 5k for testing.

To generate pairwise constraints, we randomly select pairs of instances and label them either as must-link or cannot-link constraints (depending on their true relations). The number of must-links and cannot-links are kept equal. The results are evaluated using normalized mutual information (NMI Strehl and Ghosh (2002)), which attains a maximal value 1 for two identical partitions. To reduce the effect of randomness, we generate 5 different sets of pairwise constraints for each number of constraints ; the final score is the NMI average over these 5 sets.

4.2 Comparison with related models

We first compare the performance of SC with other SSC approaches for various levels of pairwise constraints. We restrict our attention to the DNN-based methods.

As explained in Section 3.3

, each of the networks in the Siamese pair in CluNet is equipped with a softmax output layer. To make our model domain-agnostic, rather than specialized to a specific dataset or domain, we use two dense hidden layers with 256 neurons each and ReLU activation function, as well as dropout after each hidden layer (with rate 0.1, except for the Reuters dataset where we use dropout rate of 0.5). Each batch consists of 100 training pairwise constraints and 1000 unlabeled pairs labeled by the LabNet. The learning rate is set to

. The LabNet has an analogous structure: each DNN in Figure 1(a), which corresponds to the mapping introduced in Section 3.2, has 2 hidden dense layers with 256 neurons each, ReLU activation function, and dropout with the same rates, and an output dense layer also with 256 neurons, but with sigmoid activation function (i.e., ). The threshold in is set to . A more detailed study of the selection of is presented below. We use batch size of 256 examples and learning rate of . We restrict our attention to fully supervised version of LabNet.

For comparison we select three recent SSC methods:

  • d-graph: this is a DNN-based implementation of d-graph Śmieja et al. (2018). The network architecture is identical to CluNet (the batch structure is also the same). The closest unlabeled pairs in each batch are labeled as auxiliary must-link constraints, while the remaining pairs are considered as cannot-link222We also tried different numbers of neighbors, but the results were worse..

  • DCPR: this is a DNN-based implementation of DCPR Pei et al. (2016) (the architecture and the structure of batch is the same as in d-graph). The entropy and conditional entropy used to regularize the clustering model are estimated from each batch.

  • IDEC: this is a SSC method proposed in Zhang et al. (2019) (http://github.com/blueocean92) using pairwise constraints. The network structure and training procedure follow the author’s code.

The results presented in Figure 3 show the good performance of SC, specially with the larger numbers of constraints. For the smaller numbers of constraints (100 and 200 links), LabNet is not able to accurately predict links, negatively influencing the performance of CluNet. In this case, SC is inferior to d-graph, but it is still competitive with or better than DCPR and IDEC. It is worth emphasizing the extremely good results on the Letters dataset, which is composed of 26 classes. To the best of our knowledge, a dataset with so many classes had not been used before for SSC with pairwise constraints333A subset of the Letters dataset with only 5 classes was used in Śmieja et al. (2018); Pei et al. (2016)..

The results in Figure 3 show that d-graph performs best with the smaller number of pairwise constraints. For higher number of constraints, it is outperformed by SC and IDEC. This is arguably due to the fact that d-graph generates auxiliary labeled pairs based only on distances. Moreover, a single batch may be too small to find a good k-NN graph. DCPR is competitive with SC only on the Reuters dataset, but for other datasets its performance is worse. IDEC gives good results for large number of constraints, but its performance is not stable (its results do not always increase as the number of constraints grows for MNIST and Letters).

Figure 6: Performance of clustering methods with convolutional architecture applied to image data.
Figure 7: Difference between learning statistics of SC using semi-supervised and supervised LabNet (clustering NMI as well as accuracy, ML rate, and CL rate of labeling networks).

4.3 Study of the Labeling Network

As discussed in Section 3.4, the choice of the labeling threshold may be crucial for performance of SC. Since it may be difficult to find an optimal value using cross-validation, if only a small number of labeled pairs is available, we experimentally analyze various threshold values to get better insight into our model.

The results presented in Figure 4 are consistent with the reasoning presented in Section 3.4. For small numbers of given constraints (small ), LabNet is unable to correctly predict pairwise relations. It is thus better to use a low thresholds and assign must-link constraints only to the most confident pairs, because erroneous must-link constraints negatively affect the clustering results and, as argued in Section 3.4, cannot-link constraints have essentially a regularization effect. For larger numbers of labeled pairs, a higher threshold can be used due to the better accuracy of LabNet. Nevertheless, it is difficult to define a general rule for threshold selection, but it can be seen that is a safe choice leading to good results for all datasets at all levels of semi-supervision.

To get further insight into our model, we compute the correlation coefficients between the clustering NMI and the classification statistics gathered from the LabNet. Namely, we consider: (a) accuracy; (b) must-link (ML) rate, ; (c) cannot-link (CL) rate, . These quantities are defined as

While accuracy measures the overall performance of LabNet classifier, ML and CL rates assess how the model predicts examples from underlying classes. Figure 5 shows that for small and medium numbers of constraints, the CL rate has the highest correlation with the clustering performance as measured by NMI. It is also interesting to observe that, in most cases, the ML rate has negative correlation with NMI, which partially confirms our intuition that labeling cannot-link pairs as must-link has a negative effect on the final performance. On the other hand, assigning cannot-link labels to must-link pairs does not have a negative influence, because it simply leads to stronger regularization. Such a labelling does not improve the performance of clustering model, but it also does not deteriorate it. For the highest numbers of constraints (2k and 5k) the correlation with CL rate is not so strong (it is negative for Fashion-MNIST). We verified that, in that cases, CL rates were higher than 95% for most models. Consequently, the clustering results could be only improved by increasing the ML rate. It is evident that the accuracy of LabNet cannot be used as the only indicator of final success. Clearly, higher accuracy allows obtaining better clustering results, but ML and CL rates give us more detailed information. In particular, it is important to use a labeling network which has high CL rate and only then one should care about ML rate.

4.4 Model specialized to image processing

In the previous experiments, we used dense neural networks, which can be applied to generic (not too high-dimensional) datasets regardless on their domain. We now show that the performance of our method can be further increased by selecting network architecture specialized to a given task. In particular, we present its specialization to image data, using the MNIST and Fashion-MNIST datasets. In addition, we also consider semi-supervised version of LabNet Noroozi et al. (2017), which is trained on unlabeled pairs as well.

The CluNet is instantiated using two convolutional layers (32 filters each) with max pooling and dropout after each one. This is followed by two dense layers (with 128 and 10 neurons, respectively) and dropout between them. The architecture of the LabNet is composed of identical convolutional layers with max pooling and dropout, followed by a single dense layer with 128 neurons. In the case of semi-supervised LabNet, every Siamese twin is supplied with a decoder network, which is implemented using symmetric deconvolution layers and upsampling. Based on the results presented in

Noroozi et al. (2017), we use as a trade-off parameter in (2). The other models, d-graph, DCPR and IDEC, are implemented using analogous architectures. Additionally, we use NNclustering Hsu and Kira (2015). In contrast to the other methods herein considered, NNclustering is trained only on the set of pairwise constraints (no unlabeled pairs are used); we use authors’ code, where the method is implemented using convolutional LeNet networks.

The results presented in Figure 6 demonstrate that specialized convolutional architecture allows to obtain better clustering results than using dense networks (see Figure 3 for a comparison). Moreover, the use of unlabeled data in LabNet has positive influence on the final results. Our clustering method with semi-supervised LabNet noticeably outperforms its variant with supervised LabNet when a small number of constraints is available. For larger numbers of constraints, the difference is smaller, because the network has enough data to be trained. As before, both variants of our method are better than d-graph, DCPR, and NNclustering. IDEC is also inferior to our method except the case of 5000 constraints for MNIST, where it obtains the highest performance.

In addition to clustering NMI, we also assessed accuracy, as well as the ML and CL rates for both versions of LabNet. The differences between these quantities for the semi-supervised and supervised LabNet are shown in Figure 7. The results demonstrate that semi-supervised LabNet yields higher CL rates if smaller number of constraints are given. This again confirms that our clustering method is very sensitive to erroneous ML constraints and a good labeling network should correctly predict most of cannot-link pairs.

5 Conclusion

In this paper, we introduced a classification-based approach to semi-supervised clustering with pairwise constraints. It was shown that decomposing a semi-supervised clustering task into two simpler problems, classifying pairwise relations and then performing supervised clustering, is a better option than directly solving the original task. Our framework is implemented using of two Siamese neural networks and is experimentally shown to achieve state-of-the-art performance on several benchmark datasets.

In the future, we plan to investigate different approaches for classifying pairwise relations. On the one hand, it is beneficial to construct a model that guarantees high CL rate. On the other hand, one could also design active learning mechanisms, which query must-link pairs with high probability, in order to strengthen the clustering model.


This work was carried out when M. Śmieja was a Post-Doctoral Scholar at Instituto Superior Técnico, University of Lisbon. The work was partially supported by the National Science Centre (Poland), grant no. 2016/21/D/ST6/00980.


  • S. Asafi and D. Cohen-Or (2013) Constraints as features. In

    IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR)

    pp. 1634–1641. Cited by: §2.
  • S. Basu, M. Bilenko, and R. Mooney (2004) A probabilistic framework for semi-supervised clustering. In ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), pp. 59–68. Cited by: §1, §2.
  • S. Basu, I. Davidson, and K. Wagstaff (2008) Constrained clustering: advances in algorithms, theory, and applications. CRC Press. Cited by: §1.
  • M. Bilenko, S. Basu, and R. Mooney (2004) Integrating constraints and metric learning in semi-supervised clustering. In International Conference on Machine Learning (ICML), pp. 11. Cited by: §1, §2.
  • J. Bromley, I. Guyon, Y. LeCun, Säckinger, and R. Shah (2005) Signature verification using a “Siamese" time delay neural network. In Advances in Neural Information Processing Systems (NIPS), pp. 737–744. Cited by: §1, §2.
  • D. Calandriello, G. Niu, and M. Sugiyama (2014) Semi-supervised information-maximization clustering. Neural Networks 57, pp. 103–111. Cited by: §2.
  • S. Chang, C. Aggarwal, and T. Huang (2014) Learning local semantic distances with limited supervision. In IEEE International Conference on Data Mining (ICDM), pp. 70–79. Cited by: §2.
  • D. Cheng, V. Murino, and M. Figueiredo (2007) Clustering under prior knowledge with application to image segmentation. In Advances in Neural Information Processing Systems (NIPS), pp. 401–408. Cited by: §1, §1, §2.
  • J. Davis, B. Kulis, P. Jain, S. Sra, and I. Dhillon (2007) Information-theoretic metric learning. In International Conference on Machine Learning (ICML), pp. 209–216. Cited by: §2.
  • S. Fogel, H. Averbuch-Elor, D. Cohen-Or, and J. Goldberger (2019) Clustering-driven deep embedding with pairwise constraints. IEEE Computer Graphics and Applications 39 (4), pp. 16–27. Cited by: §2.
  • P. Frey and D. Slate (1991) Letter recognition using holland-style adaptive classifiers. Machine learning 6 (2), pp. 161–182. Cited by: item 4.
  • S. Furuichi (2006) Information theoretical properties of Tsallis entropies. Journal of Mathematical Physics 47 (2), pp. 023302. Cited by: §3.4.
  • R. Hadsell, S. Chopra, and Y. LeCun (2006) Dimensionality reduction by learning an invariant mapping. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 2, pp. 1735–1742. Cited by: §2.
  • Y. Hsu and Z. Kira (2015) Neural network-based clustering using pairwise constraints. arXiv:1511.06321. Cited by: §2, §4.4.
  • S. Kaski, J. Sinkkonen, and Klami. A. (2005) Discriminative clustering. Neurocomputing 69 (1–3), pp. 18–41. Cited by: §2.
  • J. Kawale and D. Boley (2013) Constrained spectral clustering using l1 regularization. In SIAM International Conference on Data Mining (SDM), pp. 103–111. Cited by: §2.
  • G. Koch, R. Zemel, and R. Salakhutdinov (2015) Siamese neural networks for one-shot image recognition. In ICML Deep Learning Workshop, Cited by: §1, §2, §3.2, footnote 1.
  • A. Krause, P. Perona, and R. Gomes (2010) Discriminative clustering by regularized information maximization. In Advances in Neural Information Processing Systems (NIPS), pp. 775–783. Cited by: §2.
  • M. Law, A. Topchy, and A. Jain (2005) Model-based clustering with probabilistic constraints. In SIAM Conference on Data Mining (SDM), pp. 641–645. Cited by: §1, §1.
  • Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: item 1.
  • D. Lewis, Y. Yang, T. Rose, and F. Li (2004) RCV1: a new benchmark collection for text categorization research. Journal of Machine Learning Research 5, pp. 361–397. Cited by: item 3.
  • H. Liu and Y. Fu (2015) Clustering with partition level side information. In 2015 IEEE International Conference on Data Mining, pp. 877–882. Cited by: §1.
  • M. Lu, X. Zhao, L. Zhang, and F. Li (2016) Semi-supervised concept factorization for document clustering. Information Sciences 331, pp. 86–98. Cited by: §2.
  • Z. Lu and T. Leen (2004) Semi-supervised learning with penalized probabilistic clustering.. In Advances in Neural Information Processing Systems (NIPS), pp. 849–856. Cited by: §1, §1, §2.
  • V. Melnykov, I. Melnykov, and S. Michael (2016) Semi-supervised model-based clustering with positive and negative constraints. Advances in data analysis and classification 10 (3), pp. 327–349. Cited by: §1.
  • V. Noroozi, L. Zheng, S. Bahaadini, S. Xie, and P. Yu (2017) Seven: deep semi-supervised verification networks. In

    International Joint Conference on Artificial Intelligence (IJCAI)

    pp. 2571–2577. Cited by: §3.2, §4.4, §4.4.
  • Y. Pei, X. Fern, T. Tjahja, and R. Rosales (2016) Comparing clustering with pairwise and relative constraints: a unified framework. ACM Transactions on Knowledge Discovery from Data (TKDD) 11 (2). Cited by: §1, §2, §2, item 2, footnote 3.
  • P. Qian, Y. Jiang, S. Wang, K. Su, J. Wang, L. Hu, and R. Muzic (2017) Affinity and penalty jointly constrained spectral clustering with all-compatibility, flexibility, and robustness. IEEE Transactions on Neural Networks and Learning Systems 28 (5), pp. 1123–1138. Cited by: §1, §2.
  • A. Shukla, G. Cheema, and S. Anand (2018) Semi-supervised clustering with neural networks. arXiv:1806.01547. Cited by: §2.
  • M. Śmieja, O. Myronov, and J. Tabor (2018) Semi-supervised discriminative clustering with graph regularization. Knowledge-Based Systems 151, pp. 24–36. Cited by: §1, §2, §2, item 1, footnote 3.
  • M. Śmieja and M. Wiercioch (2017) Constrained clustering with a complex cluster structure. Advances in Data Analysis and Classification 11 (3), pp. 493–518. Cited by: §2.
  • A. Strehl and J. Ghosh (2002) Cluster ensembles - a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research 3, pp. 358–617. Cited by: §4.1.
  • H. Wang, R. Nie, X. Liu, and T. Li (2012) Constraint projections for semi-supervised affinity propagation. Knowledge-Based Systems 36, pp. 315–321. Cited by: §2.
  • Z. Wang and I. Davidson (2010) Flexible constrained spectral clustering. In Proc. ACM Int. Conf. on Knowledge Discovery and Data Mining (SIGKDD), Washington, DC, pp. 563–572. Cited by: §2.
  • H. Xiao, K. Rasul, and R. Vollgraf (2017) Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. arXiv:1708.07747. Cited by: item 2.
  • J. Xie, R. Girshick, and A. Farhadi (2016)

    Unsupervised deep embedding for clustering analysis

    In International Conference on Machine Learning (ICML), pp. 478–487. Cited by: §2.
  • E. Xing, M. Jordan, S. Russell, and A. Ng (2003) Distance metric learning with application to clustering with side-information. In Advances in Neural Information Processing Systems (NIPS), pp. 521–528. Cited by: §2.
  • X. Yin, S. Chen, E. Hu, and D. Zhang (2010) Semi-supervised clustering with metric learning: an adaptive kernel method. Pattern Recognition 43 (4), pp. 1320–1333. Cited by: §2.
  • H. Zhang, S. Basu, and I. Davidson (2019) Deep constrained clustering-algorithms and advances. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases (ECML-EKDD), pp. 17. Cited by: §2, item 3.