1 Introduction
Clustering is an important unsupervised learning tool often used to analyze the structure of complex highdimensional data. Without any additional information about the underlying class/cluster structure, clustering results may contradict our prior knowledge or assumptions about the data being analyzed.
Semisupervised clustering (SSC) methods tackle this issue by leveraging partial prior information about class labels, with the goal of obtaining partitions that are better aligned with true classes Basu et al. (2004, 2008); Liu and Fu (2015); Cheng et al. (2007); Law et al. (2005); Lu and Leen (2004). One typical way of injecting class label information into clustering is in the form of pairwise constraints (typically, mustlink and cannotlink constraints), or pairwise preferences (e.g., shouldlink and shouldn’tlink), which indicate whether a given pair of points is believed to belong to the same or different classes.Most SSC approaches rely on adapting existing unsupervised clustering methods to handle partial (namely, pairwise) information Melnykov et al. (2016); Bilenko et al. (2004); Cheng et al. (2007); Law et al. (2005); Lu and Leen (2004); Qian et al. (2017). This requires transferring classlabel knowledge into a clustering algorithm, which is often unnatural and puts a higher weight on clustering structure than on class labels. It has been recently shown that discriminative clustering methods, which approach clustering problems using classification tools, are usually more effective in taking advantage of label constraints/information Pei et al. (2016); Śmieja et al. (2018). While those formulations assume that class labels are the primary source of semisupervision, it might be difficult to produce satisfactory results in the presence of a large number of clusters. In that case, a small number of pairwise constraints may not allow to determine the correct clusters assignments.
In this paper, we go one step further than other discriminative approaches and decouple SSC into two stages:
 Stage 1:

predict pairwise relations between pairs of unlabelled points (a binary classification problem), which allows assigning predicted labels to unlabeled pairs, thus increasing the number of labeled pairs,
 Stage 2:

use the labeled pairs (both given and predicted) in a semisupervised clustering method.
The rationale behind our approach follows from the observation that it is easier to learn a binary classifier than to solve a multiclass problem under partial supervision, especially when the number of classes (clusters) is high. To increase the flexibility of our framework, we instantiate it with two neural networks, specifically with socalled Siamese neural networks Bromley et al. (2005); Koch et al. (2015). The first network (LabNet–labeling network) is used to classify pairs of examples as mustlink or cannotlink constraints, while the second network (CluNet–clustering network) is trained on labeled pairs to predict final clusters assignments (see Figure 1). We term our method SC– SemiSupervised Siamese Classifiers for Clustering.
In the experiments reported below, we implement SC with generalpurpose dense deep neural networks (DNNs) as well as with convolutional neural networks (CNNs) to handle images. In both cases, SC outperforms other neuralnetworkbased SSC techniques. Additionally, we experimentally and theoretically analyze the impact of the networks’ parameterization on the clustering results. Our contributions are summarized as follows:

a classificationbased method for SSC with pairwise constraints, which first labels pairs of data points and then uses these predicted labels to perform SSC;

an implementation of the proposed method with two Siamese DNNs, allowing to control the flexibility of the clustering model by adjusting the numbers of layers and neurons; the corresponding parameterization is studied theoretically and experimentally.

experimental results showing the superiority of the proposed SC method over related approaches on several datasets, including Letters, which has 26 classes; to the best of our knowledge, SSC had not been tested on such a large number of classes.
The code of our method will be made publicly available online after acceptance of the paper.
2 Related work
The most common way of using pairwise constraints in SSC relies on modifying the underlying cost function of a classical unsupervised clustering models Śmieja and Wiercioch (2017); Qian et al. (2017); Lu et al. (2016)
. Such an approach was used in kmeans, using a term penalizing pairwise constraint violation
Bilenko et al. (2004), and in Gaussian mixture models (GMM), with hidden Markov random fields modelling pairwise relations Basu et al. (2004); Cheng et al. (2007); Lu and Leen (2004). In spectral clustering, the underlying eigenvalue problem was modified by adding the pairwise constraints to the corresponding objective function
Kawale and Boley (2013); Wang and Davidson (2010). Another line of work focuses on modifying the similarity measure based on the pairwise relations Asafi and CohenOr (2013); Chang et al. (2014); Wang et al. (2012), by learning optimal Mahalanobis distances Davis et al. (2007); Xing et al. (2003), or more general kernel functions Yin et al. (2010).Recently, it has been shown that discriminative clustering formulations Kaski et al. (2005) are often more effective in leveraging pairwise relations than the aforementioned methods. The authors of Pei et al. (2016)
used an analogue of the classification logloss function based on pairwise constraints and added entropy regularization
Krause et al. (2010) to prevent degenerate solutions. In a similar spirit, Śmieja et al. (2018) maximized the expected number of correctly classified pairs based on pairwise constraints and an underlying distance function. The authors of Calandriello et al. (2014) used a squaredloss mutual information to regularize a discriminative clustering model.Although DNNs are dominant in many areas of machine learning, their have rarely been used for SSC. The authors of
Hsu and Kira (2015) used a KLdivergencebased loss to train a DNN to predict cluster distribution from pairwise relations; one limitation of that method is its inability to use unlabeled data. Other works Fogel et al. (2019); Shukla et al. (2018); Zhang et al. (2019) used autoencoders with reconstruction losses to exploit inner characteristics of unlabeled data. In Shukla et al. (2018), the kmeans loss is combined with KLdivergence to create compact clusters preserving pairwise relations. In Fogel et al. (2019), the distance between must/cannotlink pairs was minimized/maximized, instead of using KLdivergence. Deep embedding clustering (DEC) is a method that jointly learns feature representations and cluster assignments using deep neural networks Xie et al. (2016). Finally, a method capable of using various types of side information has been proposed in Zhang et al. (2019).Our work extends recent discriminative SCC methods Śmieja et al. (2018); Pei et al. (2016) by learning additional pairwise relations. Moreover, the approach is implemented using Siamese neural networks Bromley et al. (2005); Hadsell et al. (2006); Koch et al. (2015), allowing for higher flexibility. In contrast to the aforementioned deep SSC methods, our model is fully discriminative and uses misclassification error as the only loss term.
3 Proposed Method
3.1 Formulation
Let be a dataset, where every instance belongs to one of classes. The goal is to split into clusters, which are compatible with the true (unknown) classes.
We assume that partial class information is given in the form of pairwise constraints, indicating whether two examples belong to the same (mustlink constraint) or different (cannotlink constraint) classes. Formally, the class information is expressed via a set , where , and
To make the notation lighter in what follows, we assume that always contains all the pairs of the form , because the binary relation “belong to the same class" is obviously reflexive. Furthermore, because the binary relations “belong to the same class" and “belong to different classes" are both symmetric, we also assume that
Finally, let denote the set of unlabeled pairs.
The proposed SC model is composed of two classification neural networks. The labeling network (LabNet) is trained to assign labels (mustlink or cannotlink) to new pairs of examples not in . The clustering network (CluNet) is trained to use labeled pairs to predict clusters assignments. The proposed scheme is illustrated in Figure 1 and next described in detail.
3.2 The Labeling Network: LabNet
Instead of doing SSC directly, which can be a difficult multiclass problem, we first tackle a simpler binary classification problem: learning to label new pairs (i.e., not in ) as belonging to the same class (mustlink) or different classes (cannotlink). By classifying instances in but not in , we obtain new mustlink/cannotlink labels that will be used by CluNet to predict the final clusters assignments (as described in the next subsection).
We address this classification problem using a pair of Siamese neural networks (identical networks, i.e., with shared weights) Koch et al. (2015). The task of these networks is to take a pair of points and return their representations , based on which it will be decided if and are in the same or different classes. Naturally, they are trained to make close to , if , and distant from , if . To this end, we use a contrastive loss based on the Euclidean distance , defined as:
(1) 
Notice that the presence of pairs of the form in does not contribute to because
Clearly, being a distance, , for all . Observe that a cannotlink pair contributes to the loss only if its distance is below 1, see Figure 2. A crucial aspect is that LabNet does not decide whether two points belong to the same or different classes; it only yields similarity scores for pairs of data points^{1}^{1}1Siamese networks have been used for oneshot learning Koch et al. (2015), where the class of a given example is decided by comparing the output one of the twin networks with that of the other on a set of examples of known classes.. A hard link prediction is obtained by comparing the distance with a threshold : and are classified as being in the same class if and only if . One natural choice is , because if , then as well. Below, we will explain that is usually a better choice in our case.
In the training phase of the LabNet, only pairwise constraints are used (the loss in (1) only depends on ). To leverage information contained in unlabeled data, we consider an adaptation of SEVEN (SEmisupervised VErification Network Noroozi et al. (2017)), yielding a semisupervised version of LabNet. The idea is to encourage the mapping to learn a salient structure shared by all categories. For this purpose, each Siamese twin is supplied with a decoder network , which aims at obtaining a reconstruction of from its latent representation : . This goal is pursued by using a reconstruction error loss term,
Finally, the total loss used for training the semisupervised LabNet is
(2) 
where is a tradeoff parameter.
Once trained, the LabNet is applied to yield pairwise constraints for all pairs of data points. Let
clearly, .
3.3 The Clustering Network: CluNet
Since, by the application of the LabNet to the unlabeled pairs yields pairwise constraints to all the pairs in the dataset, the final clustering can be obtained in a purely supervised manner. Instead of a typical unsupervised clustering method (e.g.
, kmeans or GMM), we thus employ a discriminative framework, which is more effective in the supervised case. Namely, we directly model cluster assignments with posterior probabilities
, for. From these posterior probability estimates,
may be partitioned by assigning every point to the cluster that maximizes .To provide sufficient flexibility, we instantiate the CluNet as a Siamese pair of identical DNNs, where each pair of points is processed by two identical (Siamese twins) subnetworks with shared weights. Equipped with softmax output layers, these Siamese twin networks yield class posterior probabilities and , for each pair of items .
To form clusters consistent with pairwise constraints, we aim at minimizing the number of misclassified pairs. Note that, given the posterior class probabilities and , for a pair of points , the probability that and are in the same cluster is given by
whereas is the probability that they are in different clusters. We thus define the misclassification loss with respect to the mustlink and cannotlink information as
(3) 
The structure of the CluNet is shown in Figure (b)b. Whereas during the training phase, the loss function in (3) uses the Siamese pair since it applies to pairs of points, in the testing phase, only one of the networks is needed (as indicated in Figure (b)b) to produce cluster assignments, where a given point is assigned to the cluster with the highest posterior probability: .
3.4 Adjusting the LabNet Classification Threshold
We analyze the influence of the LabNet threshold on the CluNet results. Let us begin by assuming that ; in this case, all pairs in are labeled by the LabNet as cannotlinks (i.e., and ) and the loss in (3) can be written as
Assuming (as will be the case in all the experiments below and is the typical scenario in SSC),
(4)  
where
is an estimate of the probability of th class approximated by the clustering model (notice that ). The last term in (4) is related to the index2 Tsallis entropy Furuichi (2006)
Since
is maximized by the uniform distribution,
is (approximately) minimized by taking equallysized clusters. This means that by predicting a large number of cannotlink pairs (by setting ) encourages high entropy (approximately uniform) clusterings and discourages degenerate solutions.By increasing the threshold , more pairs are classified as mustlink and fewer as cannotlink. In this case,
where (above a certain value) . This can be rewritten as
where the last term is a constant that depends only on the output of the LabNet. This form of the loss function shows that: (a) it encourages pairs in to be given high probability of being classified in the same class (large ); (b) it encourages the Tsallis entropy of the estimated class probabilities to be high (low ). In other words, mustlink constraints (those in ) play a more active role in this loss function, whereas cannotlink pairs (those in ) essentially only contribute to the entropic term of the loss.
The observation in the previous paragraph shows that obtaining mustlink constraints is crucial for the performance of the CluNet. This is however a doubleedged sword; correct mustlinks provide valuable information to train the CluNet, but erroneous ones may be very harmful. If two instances from different classes are wrongly put in , this directly impacts the middle term of the loss , whereas two examples from the same class that are wrongly put in essentially only affect the regularization term (first term of ), in addition to being missing from . Furthermore, erroneous mustlink constraint can be implicitly propagated to other pairs due to the transitivity of the binary relation “belong to the same class", whereas the binary relation “belong to different classes" is not transitive.
The above considerations suggest that it is safer to use small values of threshold . This is especially important if the number of given pairwise constraints is small, because the accuracy of LabNet may then be low. In this case, the LabNet with a small puts in only pairs about which it is very confident. The other pairs will contribute to the entropic regularization term. If the number of given constraints is larger, we can use a higher threshold and label more pairs as mustlink with higher confidence. Consequently, may be optimal only in the presence of large sets of constrains, which is seldom the case in practice. Experimental validation of this rationale is presented in Section 4.3.
4 Experiments
In this section, we evaluate our approach SC against stateoftheart methods and investigate the effect of the parametrization of the LabNet on the clustering results.
4.1 Experimental setting
We consider four popular datasets with normalized attributes:

MNIST: It contains 70k gray scale images of handwritten digits of the size (60k for training and 10k for testing) LeCun et al. (1998). The set is divided into 10 equallysized classes.

FashionMNIST: It is a dataset of 70k gray scale images of the size (60k for training and 10k for testing) with 10 classes Xiao et al. (2017). Images show clothing items.

Reuters: This dataset contains English news stories labeled with a category tree Lewis et al. (2004). Analogically to previous uses of this data in clustering, we randomly sampled a subset of 12k examples (10k for training and 2k for testing) from 4 root classes: corporate/industrial, government/social, markets and economics. Documents were represented using TFIDF features on the 2000 most frequent words.

Letters: This dataset contains a description of capital letters in the English alphabet (26 classes) Frey and Slate (1991)
. The character images were based on 20 different fonts and each letter within these 20 fonts was randomly distorted to produce 20k examples. Each example was converted into 16 primitive numerical attributes (statistical moments and edge counts). We used 15k first examples for training and remaining 5k for testing.
To generate pairwise constraints, we randomly select pairs of instances and label them either as mustlink or cannotlink constraints (depending on their true relations). The number of mustlinks and cannotlinks are kept equal. The results are evaluated using normalized mutual information (NMI Strehl and Ghosh (2002)), which attains a maximal value 1 for two identical partitions. To reduce the effect of randomness, we generate 5 different sets of pairwise constraints for each number of constraints ; the final score is the NMI average over these 5 sets.
4.2 Comparison with related models
We first compare the performance of SC with other SSC approaches for various levels of pairwise constraints. We restrict our attention to the DNNbased methods.
As explained in Section 3.3
, each of the networks in the Siamese pair in CluNet is equipped with a softmax output layer. To make our model domainagnostic, rather than specialized to a specific dataset or domain, we use two dense hidden layers with 256 neurons each and ReLU activation function, as well as dropout after each hidden layer (with rate 0.1, except for the Reuters dataset where we use dropout rate of 0.5). Each batch consists of 100 training pairwise constraints and 1000 unlabeled pairs labeled by the LabNet. The learning rate is set to
. The LabNet has an analogous structure: each DNN in Figure 1(a), which corresponds to the mapping introduced in Section 3.2, has 2 hidden dense layers with 256 neurons each, ReLU activation function, and dropout with the same rates, and an output dense layer also with 256 neurons, but with sigmoid activation function (i.e., ). The threshold in is set to . A more detailed study of the selection of is presented below. We use batch size of 256 examples and learning rate of . We restrict our attention to fully supervised version of LabNet.For comparison we select three recent SSC methods:

dgraph: this is a DNNbased implementation of dgraph Śmieja et al. (2018). The network architecture is identical to CluNet (the batch structure is also the same). The closest unlabeled pairs in each batch are labeled as auxiliary mustlink constraints, while the remaining pairs are considered as cannotlink^{2}^{2}2We also tried different numbers of neighbors, but the results were worse..

DCPR: this is a DNNbased implementation of DCPR Pei et al. (2016) (the architecture and the structure of batch is the same as in dgraph). The entropy and conditional entropy used to regularize the clustering model are estimated from each batch.

IDEC: this is a SSC method proposed in Zhang et al. (2019) (http://github.com/blueocean92) using pairwise constraints. The network structure and training procedure follow the author’s code.
The results presented in Figure 3 show the good performance of SC, specially with the larger numbers of constraints. For the smaller numbers of constraints (100 and 200 links), LabNet is not able to accurately predict links, negatively influencing the performance of CluNet. In this case, SC is inferior to dgraph, but it is still competitive with or better than DCPR and IDEC. It is worth emphasizing the extremely good results on the Letters dataset, which is composed of 26 classes. To the best of our knowledge, a dataset with so many classes had not been used before for SSC with pairwise constraints^{3}^{3}3A subset of the Letters dataset with only 5 classes was used in Śmieja et al. (2018); Pei et al. (2016)..
The results in Figure 3 show that dgraph performs best with the smaller number of pairwise constraints. For higher number of constraints, it is outperformed by SC and IDEC. This is arguably due to the fact that dgraph generates auxiliary labeled pairs based only on distances. Moreover, a single batch may be too small to find a good kNN graph. DCPR is competitive with SC only on the Reuters dataset, but for other datasets its performance is worse. IDEC gives good results for large number of constraints, but its performance is not stable (its results do not always increase as the number of constraints grows for MNIST and Letters).
4.3 Study of the Labeling Network
As discussed in Section 3.4, the choice of the labeling threshold may be crucial for performance of SC. Since it may be difficult to find an optimal value using crossvalidation, if only a small number of labeled pairs is available, we experimentally analyze various threshold values to get better insight into our model.
The results presented in Figure 4 are consistent with the reasoning presented in Section 3.4. For small numbers of given constraints (small ), LabNet is unable to correctly predict pairwise relations. It is thus better to use a low thresholds and assign mustlink constraints only to the most confident pairs, because erroneous mustlink constraints negatively affect the clustering results and, as argued in Section 3.4, cannotlink constraints have essentially a regularization effect. For larger numbers of labeled pairs, a higher threshold can be used due to the better accuracy of LabNet. Nevertheless, it is difficult to define a general rule for threshold selection, but it can be seen that is a safe choice leading to good results for all datasets at all levels of semisupervision.
To get further insight into our model, we compute the correlation coefficients between the clustering NMI and the classification statistics gathered from the LabNet. Namely, we consider: (a) accuracy; (b) mustlink (ML) rate, ; (c) cannotlink (CL) rate, . These quantities are defined as
While accuracy measures the overall performance of LabNet classifier, ML and CL rates assess how the model predicts examples from underlying classes. Figure 5 shows that for small and medium numbers of constraints, the CL rate has the highest correlation with the clustering performance as measured by NMI. It is also interesting to observe that, in most cases, the ML rate has negative correlation with NMI, which partially confirms our intuition that labeling cannotlink pairs as mustlink has a negative effect on the final performance. On the other hand, assigning cannotlink labels to mustlink pairs does not have a negative influence, because it simply leads to stronger regularization. Such a labelling does not improve the performance of clustering model, but it also does not deteriorate it. For the highest numbers of constraints (2k and 5k) the correlation with CL rate is not so strong (it is negative for FashionMNIST). We verified that, in that cases, CL rates were higher than 95% for most models. Consequently, the clustering results could be only improved by increasing the ML rate. It is evident that the accuracy of LabNet cannot be used as the only indicator of final success. Clearly, higher accuracy allows obtaining better clustering results, but ML and CL rates give us more detailed information. In particular, it is important to use a labeling network which has high CL rate and only then one should care about ML rate.
4.4 Model specialized to image processing
In the previous experiments, we used dense neural networks, which can be applied to generic (not too highdimensional) datasets regardless on their domain. We now show that the performance of our method can be further increased by selecting network architecture specialized to a given task. In particular, we present its specialization to image data, using the MNIST and FashionMNIST datasets. In addition, we also consider semisupervised version of LabNet Noroozi et al. (2017), which is trained on unlabeled pairs as well.
The CluNet is instantiated using two convolutional layers (32 filters each) with max pooling and dropout after each one. This is followed by two dense layers (with 128 and 10 neurons, respectively) and dropout between them. The architecture of the LabNet is composed of identical convolutional layers with max pooling and dropout, followed by a single dense layer with 128 neurons. In the case of semisupervised LabNet, every Siamese twin is supplied with a decoder network, which is implemented using symmetric deconvolution layers and upsampling. Based on the results presented in
Noroozi et al. (2017), we use as a tradeoff parameter in (2). The other models, dgraph, DCPR and IDEC, are implemented using analogous architectures. Additionally, we use NNclustering Hsu and Kira (2015). In contrast to the other methods herein considered, NNclustering is trained only on the set of pairwise constraints (no unlabeled pairs are used); we use authors’ code, where the method is implemented using convolutional LeNet networks.The results presented in Figure 6 demonstrate that specialized convolutional architecture allows to obtain better clustering results than using dense networks (see Figure 3 for a comparison). Moreover, the use of unlabeled data in LabNet has positive influence on the final results. Our clustering method with semisupervised LabNet noticeably outperforms its variant with supervised LabNet when a small number of constraints is available. For larger numbers of constraints, the difference is smaller, because the network has enough data to be trained. As before, both variants of our method are better than dgraph, DCPR, and NNclustering. IDEC is also inferior to our method except the case of 5000 constraints for MNIST, where it obtains the highest performance.
In addition to clustering NMI, we also assessed accuracy, as well as the ML and CL rates for both versions of LabNet. The differences between these quantities for the semisupervised and supervised LabNet are shown in Figure 7. The results demonstrate that semisupervised LabNet yields higher CL rates if smaller number of constraints are given. This again confirms that our clustering method is very sensitive to erroneous ML constraints and a good labeling network should correctly predict most of cannotlink pairs.
5 Conclusion
In this paper, we introduced a classificationbased approach to semisupervised clustering with pairwise constraints. It was shown that decomposing a semisupervised clustering task into two simpler problems, classifying pairwise relations and then performing supervised clustering, is a better option than directly solving the original task. Our framework is implemented using of two Siamese neural networks and is experimentally shown to achieve stateoftheart performance on several benchmark datasets.
In the future, we plan to investigate different approaches for classifying pairwise relations. On the one hand, it is beneficial to construct a model that guarantees high CL rate. On the other hand, one could also design active learning mechanisms, which query mustlink pairs with high probability, in order to strengthen the clustering model.
Acknowledgement
This work was carried out when M. Śmieja was a PostDoctoral Scholar at Instituto Superior Técnico, University of Lisbon. The work was partially supported by the National Science Centre (Poland), grant no. 2016/21/D/ST6/00980.
References

Constraints as features.
In
IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR)
, pp. 1634–1641. Cited by: §2.  A probabilistic framework for semisupervised clustering. In ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), pp. 59–68. Cited by: §1, §2.
 Constrained clustering: advances in algorithms, theory, and applications. CRC Press. Cited by: §1.
 Integrating constraints and metric learning in semisupervised clustering. In International Conference on Machine Learning (ICML), pp. 11. Cited by: §1, §2.
 Signature verification using a “Siamese" time delay neural network. In Advances in Neural Information Processing Systems (NIPS), pp. 737–744. Cited by: §1, §2.
 Semisupervised informationmaximization clustering. Neural Networks 57, pp. 103–111. Cited by: §2.
 Learning local semantic distances with limited supervision. In IEEE International Conference on Data Mining (ICDM), pp. 70–79. Cited by: §2.
 Clustering under prior knowledge with application to image segmentation. In Advances in Neural Information Processing Systems (NIPS), pp. 401–408. Cited by: §1, §1, §2.
 Informationtheoretic metric learning. In International Conference on Machine Learning (ICML), pp. 209–216. Cited by: §2.
 Clusteringdriven deep embedding with pairwise constraints. IEEE Computer Graphics and Applications 39 (4), pp. 16–27. Cited by: §2.
 Letter recognition using hollandstyle adaptive classifiers. Machine learning 6 (2), pp. 161–182. Cited by: item 4.
 Information theoretical properties of Tsallis entropies. Journal of Mathematical Physics 47 (2), pp. 023302. Cited by: §3.4.
 Dimensionality reduction by learning an invariant mapping. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 2, pp. 1735–1742. Cited by: §2.
 Neural networkbased clustering using pairwise constraints. arXiv:1511.06321. Cited by: §2, §4.4.
 Discriminative clustering. Neurocomputing 69 (1–3), pp. 18–41. Cited by: §2.
 Constrained spectral clustering using l1 regularization. In SIAM International Conference on Data Mining (SDM), pp. 103–111. Cited by: §2.
 Siamese neural networks for oneshot image recognition. In ICML Deep Learning Workshop, Cited by: §1, §2, §3.2, footnote 1.
 Discriminative clustering by regularized information maximization. In Advances in Neural Information Processing Systems (NIPS), pp. 775–783. Cited by: §2.
 Modelbased clustering with probabilistic constraints. In SIAM Conference on Data Mining (SDM), pp. 641–645. Cited by: §1, §1.
 Gradientbased learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: item 1.
 RCV1: a new benchmark collection for text categorization research. Journal of Machine Learning Research 5, pp. 361–397. Cited by: item 3.
 Clustering with partition level side information. In 2015 IEEE International Conference on Data Mining, pp. 877–882. Cited by: §1.
 Semisupervised concept factorization for document clustering. Information Sciences 331, pp. 86–98. Cited by: §2.
 Semisupervised learning with penalized probabilistic clustering.. In Advances in Neural Information Processing Systems (NIPS), pp. 849–856. Cited by: §1, §1, §2.
 Semisupervised modelbased clustering with positive and negative constraints. Advances in data analysis and classification 10 (3), pp. 327–349. Cited by: §1.

Seven: deep semisupervised verification networks.
In
International Joint Conference on Artificial Intelligence (IJCAI)
, pp. 2571–2577. Cited by: §3.2, §4.4, §4.4.  Comparing clustering with pairwise and relative constraints: a unified framework. ACM Transactions on Knowledge Discovery from Data (TKDD) 11 (2). Cited by: §1, §2, §2, item 2, footnote 3.
 Affinity and penalty jointly constrained spectral clustering with allcompatibility, flexibility, and robustness. IEEE Transactions on Neural Networks and Learning Systems 28 (5), pp. 1123–1138. Cited by: §1, §2.
 Semisupervised clustering with neural networks. arXiv:1806.01547. Cited by: §2.
 Semisupervised discriminative clustering with graph regularization. KnowledgeBased Systems 151, pp. 24–36. Cited by: §1, §2, §2, item 1, footnote 3.
 Constrained clustering with a complex cluster structure. Advances in Data Analysis and Classification 11 (3), pp. 493–518. Cited by: §2.
 Cluster ensembles  a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research 3, pp. 358–617. Cited by: §4.1.
 Constraint projections for semisupervised affinity propagation. KnowledgeBased Systems 36, pp. 315–321. Cited by: §2.
 Flexible constrained spectral clustering. In Proc. ACM Int. Conf. on Knowledge Discovery and Data Mining (SIGKDD), Washington, DC, pp. 563–572. Cited by: §2.
 FashionMNIST: a novel image dataset for benchmarking machine learning algorithms. arXiv:1708.07747. Cited by: item 2.

Unsupervised deep embedding for clustering analysis
. In International Conference on Machine Learning (ICML), pp. 478–487. Cited by: §2.  Distance metric learning with application to clustering with sideinformation. In Advances in Neural Information Processing Systems (NIPS), pp. 521–528. Cited by: §2.
 Semisupervised clustering with metric learning: an adaptive kernel method. Pattern Recognition 43 (4), pp. 1320–1333. Cited by: §2.
 Deep constrained clusteringalgorithms and advances. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases (ECMLEKDD), pp. 17. Cited by: §2, item 3.
Comments
There are no comments yet.