1 Introduction
The objective of clustering is to classify unlabeled data into disjoint groups based on their similarity, and clustering has been extensively studied in statistics and machine learning.
Kmeans [12]is a classic algorithm that clusters data so that the sum of withincluster scatters is minimized. However, its usefulness is rather limited in practice because kmeans only produces linearly separated clusters.
Kernel kmeans [5] overcomes this limitation by performing kmeans in a feature space induced by a reproducing kernel function [15]. Spectral clustering [17, 13] first unfolds nonlinear data manifolds based on samplesample similarity by a spectral embedding method, and then performs kmeans in the embedded space.These nonlinear clustering techniques is capable of handling highly complex realworld data. However, they lack objective model selection strategies, i.e., tuning parameters included in kernel functions or similarity measures need to be manually determined in an unsupervised manner. Informationmaximization clustering can address the issue of model selection [1, 7, 19]
, which learns a probabilistic classifier so that some information measure between feature vectors and cluster assignments is maximized in an unsupervised manner. In the informationmaximization approach, tuning parameters included in kernel functions or similarity measures can be systematically determined based on the informationmaximization principle. Among the informationmaximization clustering methods, the algorithm based on
squaredloss mutual information (SMI) was demonstrated to be promising [19], because it gives the clustering solution analytically via eigendecomposition.In practical situations, additional side information regarding clustering solutions is often provided, typically in the form of mustlinks and cannotlinks: A set of sample pairs which should belong to the same cluster and a set of sample pairs which should belong to different clusters, respectively. Such semisupervised clustering (which is also known as clustering with side information) has been shown to be useful in practice [23, 6, 22]. Spectral learning [9] is a semisupervised extension of spectral clustering that enhances the similarity with side information so that sample pairs tied with mustlinks have higher similarity and sample pairs tied with cannotlinks have lower similarity. On the other hand, constrained spectral clustering [24] incorporates the mustlinks and cannotlinks as constraints in the optimization problem.
However, in the same way as unsupervised clustering, the above semisupervised clustering methods suffer from lack of objective model selection strategies and thus tuning parameters included in similarity measures need to be determined manually. In this paper, we extend the unsupervised SMIbased clustering method to the semisupervised clustering scenario. The proposed method, called semisupervised SMIbased clustering (3SMIC), gives the clustering solution analytically via eigendecomposition with a systematic model selection strategy. Through experiments on realworld datasets, we demonstrate the usefulness of the proposed 3SMIC algorithm.
2 InformationMaximization Clustering with SquaredLoss Mutual Information
In this section, we formulate the problem of informationmaximization clustering and review an existing unsupervised clustering method based on squaredloss mutual information.
2.1 InformationMaximization Clustering
The goal of unsupervised clustering is to assign class labels to data instances so that similar instances share the same label and dissimilar instances have different labels. Let
be feature vectors of data instances, which are drawn independently from a probability distribution with density
. Let be class labels that we want to obtain, where denotes the number of classes and we assume to be known through the paper.The informationmaximization approach tries to learn the classposterior probability
in an unsupervised manner so that some “information” measure between feature and label is maximized. Mutual information (MI) [16] is a typical information measure for this purpose [1, 7]:(1) 
An advantage of the informationmaximization formulation is that tuning parameters included in clustering algorithms such as the Gaussian width and the regularization parameter can be objectively optimized based on the same informationmaximization principle. However, MI is known to be sensitive to outliers
[3], due to the log function that is strongly nonlinear. Furthermore, unsupervised learning of classposterior probability
under MI is highly nonconvex and finding a good local optimum is not straightforward in practice [7].To cope with this problem, an alternative information measure called squaredloss MI (SMI) has been introduced [20]:
(2) 
Ordinary MI is the KullbackLeibler (KL) divergence [10] from to , while SMI is the Pearson (PE) divergence [14]. Both KL and PE divergences belong to the class of the AliSilveyCsiszár divergences [2, 4], which is also known as the divergences. Thus, MI and SMI share many common properties, for example they are nonnegative and equal to zero if and only if feature vector and label are statistically independent. Informationmaximization clustering based on SMI was shown to be computationally advantageous [19]. Below, we review the SMIbased clustering (SMIC) algorithm.
2.2 SMIBased Clustering
In unsupervised clustering, it is not straightforward to approximate SMI (2) because labeled samples are not available. To cope with this problem, let us expand the squared term in Eq.(2). Then SMI can be expressed as
(3) 
Suppose that the classprior probability
is uniform, i.e.,Then we can express Eq.(3) as
(4) 
Let us approximate the classposterior probability by the following kernel model:
(5) 
where is the parameter vector, denotes the transpose, and denotes a kernel function. Let be the kernel matrix whose element is given by and let . Approximating the expectation over in Eq.(4) with the empirical average of samples and replacing the classposterior probability with the kernel model , we have the following SMI approximator:
(6) 
Under orthonormality of
, a global maximizer is given by the normalized eigenvectors
associated with the eigenvalues
of . Because the sign of eigenvector is arbitrary, we set the sign aswhere denotes the sign of a scalar and denotes the dimensional vector with all ones. On the other hand, since
and the classprior probability was set to be uniform, we have the following normalization condition:
Furthermore, negative outputs are rounded up to zero to ensure that outputs are nonnegative.
Taking these postprocessing issues into account, cluster assignment for is determined as the maximizer of the approximation of :
where denotes the dimensional vector with all zeros, the max operation for vectors is applied in the elementwise manner, and denotes the th element of a vector. Note that is used in the above derivation.
For outofsample prediction, cluster assignment for new sample may be obtained as
(7) 
This clustering algorithm is called the SMIbased clustering (SMIC).
SMIC may include a tuning parameter, say , in the kernel function, and the clustering results of SMIC depend on the choice of . A notable advantage of informationmaximization clustering is that such a tuning parameter can be systematically optimized by the same informationmaximization principle. More specifically, cluster assignments are first obtained for each possible
. Then the quality of clustering is measured by the SMI value estimated from paired samples
. For this purpose, the method of leastsquares mutual information (LSMI) [20] is useful because LSMI was theoretically proved to be the optimal nonparametric SMI approximator [21]; see Appendix 0.A for the details of LSMI. Thus, we compute LSMI as a function of and the tuning parameter value that maximizes LSMI is selected as the most suitable one:3 SemiSupervised SMIC
In this section, we extend SMIC to a semisupervised clustering scenario where a set of mustlinks and a set of cannotlinks are provided. A mustlink means that and are encouraged to belong to the same cluster, while a cannotlink means that and are encouraged to belong to different clusters. Let be the mustlink matrix with if a mustlink between and is given and otherwise. In the same way, we define the cannotlink matrix . We assume that for all , and for all . Below, we explain how mustlink constraints and cannotlink constraints are incorporated into the SMIC formulation.
3.1 Incorporating MustLinks in SMIC
When there exists a mustlink between and , we want them to share the same class label. Let
be the softresponse vector for . Then the inner product is maximized if and only if and belong to the same cluster with perfect confidence, i.e., and are the same vector that commonly has in one element and otherwise. Thus, the mustlink information may be utilized by increasing if . We implement this idea as
where determines how strongly we encourage the mustlinks to be satisfied.
Let us further utilize the following fact: If and belong to the same class and and belong to the same class, and also belong to the same class (i.e., a friend’s friend is a friend). Letting , we can incorporate this in SMIC as
If we set , we have a simpler form:
which will be used later.
3.2 Incorporating CannotLinks in SMIC
We may incorporate cannotlinks in SMIC in the opposite way to mustlinks, by decreasing the inner product to zero. This may be implemented as
where determines how strongly we encourage the cannotlinks to be satisfied.
In binary clustering problems where , if and belong to different classes and and belong to different classes, and actually belong to the same class (i.e., an enemy’s enemy is a friend). Let , and we will take this also into account as mustlinks in the following way:
If we set , we have
which will be used later.
3.3 Kernel Matrix Modification
Another approach to incorporating mustlinks and cannotlinks is to modify the kernel matrix . More specifically, is increased if there exists a mustlink between and , and is decreased if there exists a cannotlink between and . In this paper, we assume , and set if there exists a mustlink between and and if there exists a cannotlink between and . Let us denote the modified kernel matrix by :
This modification idea has been employed in spectral clustering [9] and demonstrated to be promising.
3.4 SemiSupervised SMIC
Finally, we combine the above three ideas as
where
(8) 
When , we fix at zero.
This is the learning criterion of semisupervised SMIC (3SMIC), whose global maximizer can be analytically obtained under orthonormality of by the leading eigenvectors of . Then the same postprocessing as the original SMIC is applied and cluster assignments are obtained. Outofsample prediction is also possible in the same way as the original SMIC.
3.5 Tuning Parameter Optimization in 3SMIC
In the original SMIC, an SMI approximator called LSMI is used for tuning parameter optimization (see Appendix 0.A). However, this is not suitable in semisupervised scenarios because the 3SMIC solution is biased to satisfy mustlinks and cannotlinks. Here, we propose using
where indicates tuning parameters in 3SMIC; in the experiments, , , and the parameter included in the kernel function is optimized. “Penalty” is the penalty for violating mustlinks and cannotlinks, which is the only tuning factor in the proposed algorithm.
4 Experiments
In this section, we experimentally evaluate the performance of the proposed 3SMIC method in comparison with popular semisupervised clustering methods: Spectral Learning (SL) [9] and Constrained Spectral Clustering (CSC) [24]. Both methods first perform semisupervised spectral embedding and then kmeans to obtain clustering results. However, we observed that the post kmeans step is often unreliable, so we use simple thresholding [17] in the case of binary clustering for CSC.
In all experiments, we will use a sparse version of the localscaling kernel [25] as the similarity measure:
where denotes the set of nearest neighbors for ( is the kernel parameter), is a local scaling factor defined as , and is the th nearest neighbor of . For SL and CSC, we test (note that there is no systematic way to choose the value of ), except for the spam dataset with that caused numerical problems in the eigensolver when testing SL. On the other hand, in 3SMIC, we choose the value of from based on the following criterion:
(9) 
where is the number of violated links. Here, both the LSMI value and the penalty are normalized so that they fall into the range . The and parameters in 3SMIC are also chosen based on Eq.(9).
We use the following realworld datasets:
 parkinson

(, , and ): The UCI dataset consisting of voice registration from patients suffering Parkinson’s disease and sane individuals. From the voice, 22 feature are extracted.
 spam

(, , and ): The UCI dataset consisting of emails, categorized in spam and nonspam. 48 wordfrequency features and 9 other frequency features such as specific characters and capitalization are extracted.
 sonar

(, , and ): The UCI dataset consisting of sonar responses from a metal object or a rock. The features represent energy in each frequency band.
 digits500

(, , and ): The USPS digits dataset consisting of images of written numbers from 0 to 9, 256 pixels in grayscale. We randomly sampled 50 numbers for each digit, and normalized each pixel intensity in the image between and .
 digits5k

(, , and ): The same USPS digits dataset but with 500 images for each class.
 faces100

(, , and ): The Olivetti Face dataset consisting of images of human faces in grayscale, 4096 pixels. We randomly selected 10 persons, and used 10 images for each person.
Mustlinks and cannotlinks are generated from the true labels, by randomly sampling a couple of points and adding the corresponding 1 to the or matrices depending on the labels of the chosen pair of points. CSC is excluded from digits5k and spam because it needs to solve the complete eigenvalue problem and its computational cost was too high on these large datasets.
We evaluate the clustering performance by the Adjusted Rand Index (ARI) [8]
between learned and true labels. Larger ARI values mean better clustering performance, and the zero ARI value means that the clustering result is equivalent to random. We investigate the ARI score as functions of the number of links used. Averages and standard deviations of ARI over
runs with different random seeds are plotted in Figure 1.We can separate the datasets into two groups. For digits500, digits5k, and faces100, the baseline performances without links are reasonable; the introduction of links significantly increase the performance, bringing it around – from –.
For parkinson, spam, and sonar where the baseline performances without links are poor, introduction of links quickly allow the clustering algorithms to find better solutions. In particular, only 3% of links (relative to all possible pairs) was sufficient for parkinson to achieve reasonable performance and surprisingly only 0.1% for spam.
As shown in Figure 1, the performance of SL depends heavily on the choice of , but there is no systematic way to choose for SL. It is important to notice that 3SMIC with chosen systematically based on Eq.(9) performs as good as SL with tuned optimally with hindsight. On the other hand, CSC performs rather stably for different values of , and it works particularly well for binary problems with a small number of links. However, it performs very poorly for multiclass problems; we observed that the post kmeans step is highly unreliable and poor local optimal solutions are often produced. For the binary problems, simply performing thresholding [17] instead of using kmeans was found to be useful. However, there seems no simple alternatives in multiclass cases. The performance of CSC drops in parkinson and sonar when the number of links is increased, although such phenomena were not observed in SL and 3SMIC.
Overall, the proposed 3SMIC method was shown to be a promising semisupervised clustering method.
5 Conclusions
In this paper, we proposed a novel informationmaximization clustering method that can utilize side information provided as mustlinks and cannotlinks. The proposed method, named semisupervised SMIbased clustering (3SMIC), allows us to compute the clustering solution analytically. This is a strong advantage over conventional approaches such as constrained spectral clustering (CSC) that requires a post kmeans step, because this post kmeans step can be unreliable and cause significant performance degradation in practice. Furthermore, 3SMIC allows us to systematically determine tuning parameters such as the kernel width based on the informationmaximization principle, given our reliance on the provided side information. Through experiments, we demonstrated that automaticallytuned 3SMIC perform as good as optimallytuned spectral learning (SL) with hindsight.
The focus of our method in this paper was to inherit the analytical treatment of the original unsupervised SMIC in semisupervised learning scenarios. Although this analytical treatment was demonstrated to be highly useful in experiments, our future work will explore more efficient use of mustlinks and cannotlinks.
In the previous work [11], negative eigenvalues were found to contain useful information. Because mustlink and cannotlink matrices can possess negative eigenvalues, it is interesting to investigate the role and effect of negative eigenvalues in the context of informationmaximization clustering.
Acknowledgements
This work was carried out when DC was visiting at Tokyo Institute of Technology by the YSEP program. GN was supported by the MEXT scholarship, and MS was supported by MEXT KAKENHI 25700022 and AOARD.
Appendix 0.A LeastSquares Mutual Information
The solution of SMIC depends on the choice of the kernel parameter included in the kernel function . Since SMIC was developed in the framework of SMI maximization, it would be natural to determine the kernel parameter so as to maximize SMI. A direct approach is to use the SMI estimator given by Eq.(6) also for kernel parameter choice. However, this direct approach is not favorable because is an unsupervised SMI estimator (i.e., SMI is estimated only from unlabeled samples ). On the other hand, in the model selection stage, we have already obtained labeled samples , and thus supervised estimation of SMI is possible. For supervised SMI estimation, a nonparametric SMI estimator called leastsquares mutual information (LSMI) [20] was proved to achieve the optimal convergence rate to the true SMI. Here we briefly review LSMI.
The key idea of LSMI is to learn the following densityratio function [18],
without going through probability density/mass estimation of , , and . More specifically, let us employ the following densityratio model:
(10) 
where and is a kernel function. In practice, we use the Gaussian kernel
where the Gaussian width is the kernel parameter. To save the computation cost, we limit the number of kernel bases to with randomly selected kernel centers.
The parameter in the above densityratio model is learned so that the following squared error is minimized:
(11) 
Let be the parameter vector corresponding to the kernel bases , i.e., is the subvector of consisting of indices . Let be the number of samples in class , which is the same as the dimensionality of . Then an empirical and regularized version of the optimization problem (11) is given for each as follows:
(12) 
where () is the regularization parameter. is the matrix and is the dimensional vector defined as
where is the th sample in class (which corresponds to ).
A notable advantage of LSMI is that the solution can be computed analytically as
Then a densityratio estimator is obtained analytically as follows:
The accuracy of the above leastsquares densityratio estimator depends on the choice of the kernel parameter included in and the regularization parameter in Eq.(12). These tuning parameter values can be systematically optimized based on crossvalidation as follows: First, the samples are divided into disjoint subsets of approximately the same size (we use in the experiments). Then a densityratio estimator is obtained using (i.e., all samples without ), and its outofsample error (which corresponds to Eq.(11) without irrelevant constant) for the holdout samples is computed as
where denotes the summation over all combinations of and in (and thus terms), while denotes the summation over all pairs in (and thus terms). This procedure is repeated for , and the average of the above holdout error over all is computed as
Then the kernel parameter and the regularization parameter that minimize the average holdout error are chosen as the most suitable ones.
Finally, given that SMI (2) can be expressed as
an SMI estimator based on the above densityratio estimator, called leastsquares mutual information (LSMI), is given as follows:
where is a densityratio estimator obtained above.
References
 [1] (2006) Kernelized infomax clustering. In Advances in Neural Information Processing Systems 18, Y. Weiss, B. Schölkopf, and J. Platt (Eds.), Cambridge, MA, USA, pp. 17–24. Cited by: §1, §2.1.
 [2] (1966) A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society, Series B 28 (1), pp. 131–142. Cited by: §2.1.
 [3] (1998) Robust and efficient estimation by minimising a density power divergence. Biometrika 85 (3), pp. 549–559. Cited by: §2.1.
 [4] (1967) Informationtype measures of difference of probability distributions and indirect observation. Studia Scientiarum Mathematicarum Hungarica 2, pp. 229–318. Cited by: §2.1.

[5]
(2002)
Mercer kernelbased clustering in feature space.
IEEE Transactions on Neural Networks
13 (3), pp. 780–784. Cited by: §1. 
[6]
(2007)
Dissimilarity in graphbased semisupervised classification.
In
Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics (AISTATS2007)
, pp. 155–162. Cited by: §1.  [7] (2010) Discriminative clustering by regularized information maximization. In Advances in Neural Information Processing Systems 23, J. Lafferty, C. K. I. Williams, R. Zemel, J. ShaweTaylor, and A. Culotta (Eds.), pp. 766–774. Cited by: §1, §2.1.
 [8] (1985) Comparing partitions. Journal of Classification 2 (1), pp. 193–218. Cited by: §4.
 [9] (2003) Spectral learning. In Proceedings of the 18th International Joint Conference on Artificial Intelligence (IJCAI2003), pp. 561–566. Cited by: §1, §3.3, §4.
 [10] (1951) On information and sufficiency. The Annals of Mathematical Statistics 22, pp. 79–86. Cited by: §2.1.
 [11] (2004Jul.) Feature discovery in nonmetric pairwise data. Journal of Machine Learning Research 5, pp. 801–818. Cited by: §5.
 [12] (1967) Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1, Berkeley, CA, USA, pp. 281–297. Cited by: §1.

[13]
(2002)
On spectral clustering: analysis and an algorithm
. In Advances in Neural Information Processing Systems 14, T. G. Dietterich, S. Becker, and Z. Ghahramani (Eds.), Cambridge, MA, USA, pp. 849–856. Cited by: §1.  [14] (1900) On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philosophical Magazine Series 5 50 (302), pp. 157–175. Cited by: §2.1.
 [15] (2002) Learning with kernels. MIT Press, Cambridge, MA, USA. Cited by: §1.
 [16] (1948) A mathematical theory of communication. Bell Systems Technical Journal 27, pp. 379–423. Cited by: §2.1.
 [17] (2000) Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 22 (8), pp. 888–905. Cited by: §1, §4, §4.
 [18] (2012) Density ratio estimation in machine learning. Cambridge University Press, Cambridge, UK. Cited by: Appendix 0.A.
 [19] (2011Jun. 28–Jul. 2) On informationmaximization clustering: Tuning parameter selection and analytic solution. In Proceedings of 28th International Conference on Machine Learning (ICML2011), L. Getoor and T. Scheffer (Eds.), Bellevue, Washington, USA, pp. 65–72. Cited by: §1, §2.1.
 [20] (2009) Mutual information estimation reveals global associations between stimuli and biological processes. BMC Bioinformatics 10 (1), pp. S52 (12 pages). Cited by: Appendix 0.A, §2.1, §2.2.
 [21] (2013) Sufficient dimension reduction via squaredloss mutual information estimation. Neural Computation 3 (25), pp. 725–758. Cited by: §2.2.
 [22] (2001) Constrained kmeans clustering with background knowledge. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML2001), pp. 577–584. Cited by: §1.
 [23] (2000) Clustering with instancelevel constraints. In Proceedings of the Seventeenth International Conference on Machine Learning (ICML2000), pp. 1103–1110. Cited by: §1.
 [24] (2010) Flexible constrained spectral clustering. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD2010), pp. 563–572. Cited by: §1, §4.
 [25] (2005) Selftuning spectral clustering. In Advances in Neural Information Processing Systems 17, L. K. Saul, Y. Weiss, and L. Bottou (Eds.), Cambridge, MA, USA, pp. 1601–1608. Cited by: §4.
Comments
There are no comments yet.