1 Introduction
Clustering can be thought as the task of automatically dividing a set of objects into “coherent” subsets. This definition is not concrete, but its vagueness allows it to serve as an umbrella term for a wide diversity of algorithmic paradigms. Clustering algorithms are being routinely applied in a huge variety of fields.
Given a dataset that needs to be clustered for some application, one can choose among a variety of different clustering algorithms, along with different preprocessing techniques, that are likely to result in dramatically different answers. It is therefore critical to incorporate prior knowledge about the data and the intended semantics of the clustering into the process of picking a clustering algorithm (or, clustering model selection). Regretfully, there seem to be no systematic tool for incorporation of domain expertise for clustering model selection, and such decisions are usually being made in embarrassingly ad hoc ways. This paper aims to address that critical deficiency in a formal statistical framework.
We approach the challenge by considering a scenario in which the domain expert (i.e., the intended user of the clustering) conveys her domain knowledge by providing a clustering of a small random subset of her data set. For example, consider a big customer service center that wishes to cluster incoming requests into groups to streamline their handling. Since the data base of requests is too large to be organized manually, the service center wishes to employ a clustering program. As the clustering designer, we would then ask the service center to pick a random sample of requests, manually cluster them, and show us the resulting grouping of that sample. The clustering tool then uses that sample clustering to pick a clustering method that, when applied to the full data set, will result in a clustering that follows the patterns demonstrated by that sample clustering. We address this paradigm from a statistical machine learning perspective. Aiming to achieve generalization guaranties for such an approach, it is essential to introduce some
inductive bias. We do that by restricting the clustering algorithm to a predetermined hypothesis class (or a set of concrete clustering algorithms).In a recent Dagstuhl workshop, Blum (2014) proposed to do that by fixing a clustering algorithm, say means, and searching for a metric over the data under which means optimization yields a clustering that agrees with the training sample clustering. One should note that, given any domain set , for any partitioning of , there exists some distance function over such that is the optimal means clustering solution to the input ^{1}^{1}1This property is sometimes called Richness. Consequently, to protect against potential overfitting, the class of potential distance functions should be constrained. In this paper, we provide (apparently the first) concrete formal framework for such a paradigm, as well as a generalization analysis of this approach.
In this work we focus on center based clustering  an important class of clustering algorithms. In these algorithms, the goal is to find a set of “centers” (or prototypes), and the clusters are the Voronoi cells induced by this set of centers. The objective of such a clustering is to minimize the expected value of some monotonically increasing function of the distances of points to their cluster centers. The k–means clustering objective is arguably the most popular clustering paradigm in this class. Currently, centerbased clustering tools lack a vehicle for incorporating domain expertise. Domain knowledge is usually taken into account only through an ad hoc choice of input data representation. Regretfully, it might not be realistic to require the domain expert to translate sufficiently elaborate taskrelevant knowledge into handcrafted features.
As a model for learning representations, we assume that the userdesirable clustering can be approximated by first mapping the sample to some Euclidean (or Hilbert) space and then performing means clustering in the mapped space (or equivalently, replacing the input data metric by some kernel and performing centerbased clustering with respect to that kernel). However, the clustering algorithm is supposed to learn a suitable mapping based on the given sample clustering.
The main question addressed in this work is that of the sample complexity: what is the size of a sample, to be clustered by the domain expert, that suffices for finding a closetooptimal mapping (i.e., a mapping that generalizes well on the test data)? Intuitively, this sample complexity depends on the richness of the class of potential mappings that the algorithm is choosing from. In standard supervised learning, there are well established notions of capacity of hypothesis classes (e.g., VCdimension) that characterize the sample complexity of learning. This paper aims to provide such relevant notions of capacity for clustering.
1.1 Previous Work
In practice, there are methods that use some forms of supervision for clustering. These methods are sometimes called “semisupervised clustering” (Basu et al. (2002, 2004); Kulis et al. (2009)). The most common method to convey such supervision is through a set of pairwise must/cannotlink constraints on the instances (Wagstaff et al. (2001)). A common way of using such information is by changing the objective of clustering so that violations of these constraints are penalized (Demiriz et al. (1999); Law et al. (2005); Basu et al. (2008)). Another approach, which is closer to ours, keeps the clustering optimization objective fixed, and instead, searches for a metric that best fits given constraints. The metric is learned based on some objective function over metrics ((Xing et al., 2002; Alipanahi et al., 2008)), so that pairs of instances marked mustlink will be close in the new metric space (and cannotlink pairs be considered as far apart). The two above approaches can also be integrated (Bilenko et al. (2004)). However, these objective functions are usually rather ad hoc. In particular, it is not clear in what sense they are compatible with the adopted clustering algorithm (such as kmeans clustering).
A different approach to the problem of communicating user expertise for the purpose of choosing a clustering tool is discussed in Ackerman et al. (2010). They considered a set of properties, or requirements, for clustering algorithms, and investigated which of those properties hold for various algorithms. The user can then pick the right algorithm based on the requirements that she wants the algorithm to meet. However, to turn such an approach into a practically useful tool, one will need to come up with properties that are relevant to the end user of clustering –a goal that is still far from being reached.
Statistical convergence rates of sample clustering to the optimal clustering, with respect to some data generating probability distribution, play a central role in our analysis. From that perspective, most relevant to our paper are results that provide generalization bounds for kmeans clustering.
BenDavid (2007) proposed the first dimensionindependent generalization bound for kmeans clustering based on compression techniques. Biau et al. (2008) tightened this result by an analysis of Rademacher complexity. Maurer and Pontil (2010) investigated a more general framework, in which generalization bounds for kmeans as well as other algorithms can be obtained. It should be noted that these results are about the standard clustering setup (without any supervised feedback), where the data representation is fixed and known to the clustering algorithm.1.2 Contributions
Our first contribution is to provide a statistical framework to analyze the problem of learning representation for clustering. We assume that the expert has some implicit target clustering of the dataset in his mind. The learner however, is unaware of it, and instead has to select a mapping among a set of potential mappings, under which the result of kmeans clustering will be similar to the target partition. An appropriate notion of loss function is introduced to quantify the success of the learner. Then, we define the analogous notion of PAClearnability
^{2}^{2}2PAC stands for the well known notion of “probably approximately correct”, popularized by Valiant (1984). for the problem of learning representation for clustering.The second contribution of the paper is the introduction of a combinatorial parameter, a specific notion of the capacity of the class of mappings, that determines the sample complexity of the clustering learning tasks. This combinatorial notion is a multivariate version of pseudodimension of a class of realvalued mappings. We show that there is uniform convergence of empirical losses to the true loss, over any class of embeddings, , at a rate that is determined by the proposed dimension of that . This implies that any empirical risk minimization algorithm (ERM) will successfully learn such a class from sample sizes upper bounded by those rates. Finally, we analyze a particular natural class –the class of linear mappings from to – and show that a roughly speaking, sample size of is sufficient to guarantee an optimal representation.
The rest of this paper is organized as follows: Section 2 defines the problem setting. Then in Section 3, we investigate ERMtype algorithms and show that, “uniform convergence” is sufficient for them to work. Furthermore, this section presents the uniform convergence results and the proof of an upper bound for the sample complexity. Finally, we conclude in section 4 and provide some directions for future work.
2 Problem Setting
2.1 Preliminaries
Let be a finite domain set. A clustering of is a partition of into subsets. If is a clustering, we denote the subsets of the partition by , therefore we have . Let denote the set of all permutations over where denotes . The clustering difference between two clusterings, and , with respect to is defined by
(1) 
where and denote the cardinality and the symmetric difference of sets respectively. For a sample , and (a partition of ), we define to be a partition of induced by , namely . Accordingly, the samplebased difference between two partitions is defined by
(2) 
Let be a mapping from to , and
be a vector of
centers in . The clustering defined by is the partition over induced by the Voronoi partition in . Namely,The kmeans cost of clustering with a set of centers and with respect to a mapping is defined by
(3) 
The kmeans clustering algorithm finds the set of centers that minimize this cost^{3}^{3}3We assume that the solution to kmeans clustering is unique. We will elaborate about this issue in the next sections.. In other words,
(4) 
Also, for a partition and mapping , we can define the cost of clustering as follows.
(5) 
For a mapping as above, let denote the means clustering of induced by , namely
(6) 
The difference between two mappings and with respect to is defined by the difference between the result of kmeans clustering using these mappings. Formally,
(7) 
The following proposition shows the “richness” property of kmeans objective.
Proposition 1.
Let be a domain set. For every clustering of , , and every , there exist a mapping such that .
Proof.
The mapping can be picked such that it collapses each cluster into a single point in (and so the image of under mapping will be just single points in ). The result of kmeans clustering under such mapping will be . ∎
In this paper, we investigate the transductive setup, where there is a given data set, known to the learner, that needs to be clustered. Clustering often occurs as a task over some data generating distribution (e.g., Von Luxburg and BenDavid (2005)). The current work can be readily extended to that setting. However, in that case, we assume that the clustering algorithm gets, on top of the clustered sample, a large unclustered sample drawn form that data generating distribution.
2.2 Formal Problem Statement
Let be the target clustering of . A (supervised) representation learner for clustering, takes as input a sample and its clustering, , and outputs a mapping from a set of potential mappings . In the following, PAC stands for the notion of “probably approximately correct”.
Definition 1.
PAC Supervised Representation Learner for KMeans (PACSRLK)
Let be a set of mappings from to . A representation learning algorithm is a PACSRLK with sample complexity with respect to , if for every , every domain set and every clustering of , , the following holds:
if is a randomly (uniformly) selected subset of of size at least , then with probability at least
(8) 
where , is the output of the algorithm.
This can be regarded as a formal PAC framework to analyze the problem of learning representation for kmeans clustering. The learner is compared to the best mapping in the class .
A natural question is providing bounds on the sample complexity of PACSRLK with respect to . Intuitively, for richer classes of mappings, we need larger clustered samples. Therefore, we need to introduce an appropriate notion of “capacity” for and bound the sample complexity based on it. This is addressed in the next sections.
3 Analysis and Results
3.1 Empirical Risk Minimization
In order to prove an upper bound for the sample complexity of representation learning for clustering, we need to consider an algorithm, and prove a sample complexity bound for it. Here, we show that any ERMtype algorithm can be used for this purpose. Therefore, we will be able to prove an upper bound for the sample complexity of PACSRLK.
Let be a class of mappings and be the domain set. A TERM^{4}^{4}4TERM stands for Transductive Empirical Risk Minimizer learner for takes as input a sample and its clustering and outputs:
(9) 
Note that we call it transductive, because it is implicitly assumed that it has access to unlabeled dataset (i.e., ). A TERM algorithm goes over all mappings in and selects the mapping which is the most consistent mapping with the given clustering: the mapping under which if we perform kmeans clustering of , the samplebased difference between the result and is minimized.
Note that we are not studying this algorithm as a computational tool; we only use it to show an upper bound for the sample complexity.
Intuitively, this algorithm will work well when the empirical difference and the true difference of the mappings in the class are close to each other. In this case, by minimizing the empirical difference, the algorithm will automatically minimize the true difference as well. In order to formalize this idea, we define the notion of “representativeness” of a sample.
Definition 2.
(Representative Sample) Let be a class of mappings from to . A sample is representative with respect to , and the clustering , if for every the following holds
(10) 
The following theorem shows that for the TERM algorithm to work, it is sufficient to supply it with a representative sample.
Theorem 1.
(Sufficiency of Uniform Convergence) Let be a set of mappings from to . If is an representative sample with respect to , and then
(11) 
where and .
Proof.
Using representativeness of and the fact that is the empirical minimizer of the loss function, we have
(12) 
(13) 
(14) 
(15) 
∎
Therefore, we just need to provide an upper bound for the sample complexity of uniform convergence: “how many instances do we need to make sure that with high probability our sample is representative?”
3.2 Classes of Mappings with a Uniqueness Property
In general, the solution to kmeans clustering may not be unique. Therefore, the learner may end up with finding a mapping that corresponds to multiple different clusterings. This is not desirable, because in this case, the output of the learner will not be interpretable. Therefore, it is reasonable to choose the class of potential mappings in a way that it includes only the mappings under which the solution is unique.
In order to make this idea concrete, we need to define an appropriate notion of uniqueness. We use a notion similar to the one introduced by Balcan et al. (2009) with a slight modification^{5}^{5}5Our notion is additive in both parameters rather than multiplicative.
Definition 3.
(Uniqueness) We say that kmeans clustering for domain under mapping has a unique solution, if every optimal solution of the kmeans cost is close to the optimal solution. Formally, the solution is unique if for every partition that satisfies
(16) 
would also satisfy
(17) 
In the degenerate case where the optimal solution to kmeans is not unique itself (and so is not welldefined), we say that the solution is not unique.
It can be noted that the definition of uniqueness not only requires the optimal solution to kmeans clustering to be unique, but also all the “nearoptimal” minimizers of the kmeans clustering cost should be “similar”. This is a natural strengthening of the uniqueness condition, to guard against cases where there are optimizers of the cost function (for arbitrarily small ) with totally different solutions.
Now that we have a definition for uniqueness, we can define the set of mappings for under which the solution is unique. We say that a class of mappings has uniqueness property with respect to , if every mapping in has uniqueness property over .
Note that given an arbitrary class of mappings , we can find a subset of it that satisfies uniqueness property over . Also, as argued above, this subset is the useful subset to work with. Therefore, in the rest of the paper, we investigate learning for classes with uniqueness property. In the next section, we prove uniform convergence results for such classes.
3.3 Uniform Convergence Results
In Section 3.1, we defined the notion of representative samples. Also, we proved that if a TERM algorithm is fed with such a representative sample, it will work satisfactorily. The most technical part of the proof is then about the question “how large should be the sample in order to make sure that with high probability it is actually a representative sample?”
In order to formalize this notion, let be a set of mappings from a domain to ^{6}^{6}6In the analysis, for simplicity, we will assume that the set of mappings is a function to the bounded space wherever needed. Define the sample complexity of uniform convergence, , as the minimum number such that for every fixed partition , if is a randomly (uniformly) selected subset of with size , then with probability at least , for all we have
(18) 
The technical part of this paper is devoted to provide an upper bound for this sample complexity.
3.3.1 Preliminaries
Definition 4.
(cover and covering number) Let be a set of mappings from to . A subset is called an cover for with respect to the metric if for every there exists such that . The covering number, is the size of the smallest cover of with respect to .
In the above definition, we did not specify the metric . In our analysis, we are interested in the distance with respect to , namely:
(19) 
Note that the mappings we consider are not realvalued functions, but their output is an dimensional vector. This is in contrast to the usual analysis used for learning realvalued functions. If and are realvalued, then distance is defined by
(20) 
We will prove sample complexity bounds for our problem based on the covering number of the set of mappings. However, it will be beneficial to have a bound based on some notion of capacity, similar to VCdimension, as well. This will help in better understanding and easier analysis of sample complexity of different classes. While VCdimension is defined for binary valued functions, we need a similar notion for functions with outputs in . For realvalued functions, we have such notion, called pseudodimension (Pollard (1984)).
Definition 5.
(PseudoDimension) Let be a set of functions from to . Let be a subset of . Then is pseudoshattered by if there are real numbers such that for every , there is a function with for . Pseudo dimension of , called , is the size of the largest shattered set.
It can be shown (e.g., Theorem 18.4. in Anthony and Bartlett (2009)) that for a realvalued class , if then where hides logarithmic factors of . In the next sections, we will generalize this notion to valued functions.
3.3.2 Reduction to Binary Hypothesis Classes
Let be two mappings and be a permutation over . Define the binaryvalued function as follows
(21) 
Let be the set of all such functions with respect to and :
(22) 
Finally, let be the union of all over all choices of . Formally, if is the set of all permutations over , then
(23) 
For a set , and a binary function , let . We now show that a uniform convergence result with respect to is sufficient to have uniform convergence for the difference function. Therefore, we will be able to investigate conditions for uniform convergence of rather than the difference function.
Theorem 2.
Let be a domain set, be a set of mappings, and be defined as above. If is such that
(24) 
then will be representative with respect to , i.e., for all we will have
(25) 
Proof.
(26) 
(27) 
(28) 
(29) 
∎
The fact that is a class of binaryvalued functions enables us to provide sample complexity bounds based on VCdimension of this class. However, providing bounds based on VCDim is not sufficient, in the sense that it is not convenient to work with the class . Instead, it will be nice if we can prove bounds directly based on the capacity of the class of mappings, . In the next section, we address this issue.
3.3.3 Covering Number and Uniform Convergence
The classes introduced in the previous section, and , are binary hypothesis classes. Also, we have shown that proving a uniform convergence result for is sufficient for our purpose. In this section, we show that a bound on the covering number of is sufficient to prove uniform convergence for .
In Section 3.2, we argued that we only care about the classes that have uniqueness property. In the rest of this section, assume that is a class of mappings from to that satisfies uniqueness property.
Lemma 1.
Let . If then
We leave the proof of this lemma for the appendix, and present the next lemma.
Lemma 2.
Let be defined as in the previous section. Then,
(30) 
Proof.
Let be the cover corresponding to the covering number . Based on the previous lemma, is a cover for . But we have only permutations of , therefore, the covering number for is at most times larger than . This proves the result. ∎
Basically, this means that if we have a small covering number for the mappings, we will have the uniform convergence result we were looking for. The following theorem proves this result.
Theorem 3.
Let be a set of mappings with uniqueness property. Then there for some constant we have
(31) 
Proof.
Following the previous lemma, if we have a small covering number for , we will also have a small covering number for as well. But based on standard uniform convergence theory, if a hypothesis class has small covering number, then it has uniform convergence property. More precisely, (e.g., Theorem 17.1 in Anthony and Bartlett (2009)) we have:
(32) 
Applying Lemma 2 to the above proves the result. ∎
3.3.4 Bounding Covering Number
In the previous section, we proved if the covering number of the class of mappings is bounded, then we will have uniform convergence. However, it is desirable to have a bound with respect to a combinatorial dimension of the class (rather than the covering number). Therefore, we will generalize the notion of pseudodimension for the class of mappings that take value in .
Let be a set of mappings form to . For every mapping , define realvalued functions such that . Now let . This means that are classes of realvalued functions. Now we define pseudodimension of as follow.
(33) 
Proposition 2.
Let be a set of mappings form to . If then where hides logarithmic factors.
Proof.
The result follows from the corresponding result for bounding covering number of realvalued functions based on pseudodimension mentioned in the preliminaries section. The reason is that we can create a cover by composition of the covers of all . However, this will at most introduce a factor of in the logarithm of the covering number. ∎
Therefore, we can rewrite the result of the previous section in terms of pseudodimension.
Theorem 4.
Let be a class of mappings with uniqueness property. Then
(34) 
where hides logarithmic factors of and .
3.4 Sample Complexity of PACSRLK
In Section 3.1, we showed that uniform convergence is sufficient for a TERM algorithm to work. Also, in the previous section, we proved a bound for the sample complexity of uniform convergence. The following theorem, which is the main technical result of this paper, combines these two and provides a sample complexity upper bound for PACSRLK framework.
Theorem 5.
Let be a class of unique mappings. Then the sample complexity of learning representation for means clustering with respect to is upper bounded by
(35) 
where hides logarithmic factors of and .
The proof is done by combining Theorems 1 and 4.
The following result shows an upper bound for the sample complexity of learning linear mappings (or equivalently, Mahalanobis metrics).
Corollary 1.
Let be a set of unique linear mappings from to . Then we have
(36) 
Proof.
It is a standard result that the pseudodimension of a vector space of realvalued functions is just the dimensionality of the space (in our case ) (e.g., Theorem 11.4 in Anthony and Bartlett (2009)). Also, based on our definition of for valued functions, it should scale by a factor of . ∎
4 Conclusions and Open Problems
In this paper we provided a formal statistical framework for learning the representation (i.e., a mapping) for kmeans clustering based on supervised feedback. The learner, unaware of the target clustering of the domain, is given a clustering of a sample set. The learner’s task is then finding a mapping function (among a class of mappings) under which the result of kmeans clustering of the domain is as close as possible to the true clustering. This framework was called PACSRLK.
A notion of representativeness was introduced, and it was proved that any ERMtype algorithm that has access to such a sample will work satisfactorily. Finally, a technical uniform convergence result was proved to make sure that a large enough sample is (with high probability) representative. This was used to prove an upper bound for the sample complexity of PACSRLK based on covering numbers of the set of mappings. Furthermore, a notion of pseudodimension for the class of mappings was defined, and the sample complexity was upper bounded based on it.
Note that in the analysis, the notion of uniqueness (similar to that of Balcan et al. (2009)) was used and it was argued that it is reasonable to require the learner to output a mapping under which the solution is “unique” (because otherwise the output of kmeans clustering would not be interpretable). Therefore, in the analysis, we assumed that the class of potential mappings has the uniqueness property.
It can be noted that we did not analyze the computational complexity of algorithms for PACSRLK framework. We leave this analysis to the future work. We just note that a similar notion of uniqueness proposed by Balcan et al. (2009) resulted in being able to efficiently solve the kmeans clustering algorithm.
One other observation is that representation learning can be regarded as a special case of metric learning; because for every mapping, we can define a distance function that computes the distance in the mapped space. In this light, we can make the problem more general by making the learner find a distance function rather than a mapping. This is more challenging to analyze, because we do not even know a generalization bound for centerbased clustering under general distance functions. An open question will be providing such general results.
Acknowledgments
5 Appendix
Proof of Lemma 1. Let be a set of mappings that have uniqueness property. Let and . We need to prove that . In order to prove this, note that due to triangular inequality, we have
(37) 
Therefore, it will be sufficient to show that each of the terms above is smaller than . We start by proving a useful lemma.
Lemma 3.
Let and . Let be an arbitrary set of centers in . Then
Proof.
(38) 
(39) 
(40) 
(41) 
(42) 
∎
Now we are ready to prove that the first term is smaller than , i.e., . But to do so, we only need to show that ; because in that case, due to uniqueness property of , the result will follow. Now, using Lemma 3, we have
(43) 
(44) 
(45) 
(46) 
(47) 
where in the first and the last line we used Lemma 3.
Finally, we need to prove the second inequality, i.e., . Assume contrary. But based on uniqueness property of , we conclude that . In the following, we prove that this cannot be true, and hence a contradiction.
Let . Then, based on the boundedness of , and we have:
(48) 
(49) 
(50) 
(51) 
(52) 
(53) 
(54) 
References
 Ackerman et al. (2010) Ackerman, M., BenDavid, S., and Loker, D. (2010). Towards propertybased classification of clustering paradigms. In Advances in Neural Information Processing Systems, pages 10–18.

Alipanahi et al. (2008)
Alipanahi, B., Biggs, M., Ghodsi, A., et al. (2008).
Distance metric learning vs. fisher discriminant analysis.
In
Proceedings of the 23rd national conference on Artificial intelligence
, pages 598–603.  Anthony and Bartlett (2009) Anthony, M. and Bartlett, P. L. (2009). Neural network learning: Theoretical foundations. cambridge university press.
 Balcan et al. (2009) Balcan, M.F., Blum, A., and Gupta, A. (2009). Approximate clustering without the approximation. In Proceedings of the twentieth Annual ACMSIAM Symposium on Discrete Algorithms, pages 1068–1077. Society for Industrial and Applied Mathematics.
 Basu et al. (2002) Basu, S., Banerjee, A., and Mooney, R. (2002). Semisupervised clustering by seeding. In In Proceedings of 19th International Conference on Machine Learning (ICML2002.
 Basu et al. (2004) Basu, S., Bilenko, M., and Mooney, R. J. (2004). A probabilistic framework for semisupervised clustering. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 59–68. ACM.
 Basu et al. (2008) Basu, S., Davidson, I., and Wagstaff, K. (2008). Constrained clustering: Advances in algorithms, theory, and applications. CRC Press.
 BenDavid (2007) BenDavid, S. (2007). A framework for statistical clustering with constant time approximation algorithms for kmedian and kmeans clustering. Machine Learning, 66(23):243–257.
 Biau et al. (2008) Biau, G., Devroye, L., and Lugosi, G. (2008). On the performance of clustering in hilbert spaces. Information Theory, IEEE Transactions on, 54(2):781–790.
 Bilenko et al. (2004) Bilenko, M., Basu, S., and Mooney, R. J. (2004). Integrating constraints and metric learning in semisupervised clustering. In Proceedings of the twentyfirst international conference on Machine learning, page 11. ACM.
 Blum (2014) Blum, A. (2014). Approximationstability and perturbationstability. In DAGSTUHL Workshop on Analysis of Algorithms Beyond the Worst Case.

Demiriz et al. (1999)
Demiriz, A., Bennett, K. P., and Embrechts, M. J. (1999).
Semisupervised clustering using genetic algorithms.
Artificial neural networks in engineering (ANNIE99), pages 809–814.  Kulis et al. (2009) Kulis, B., Basu, S., Dhillon, I., and Mooney, R. (2009). Semisupervised graph clustering: a kernel approach. Machine learning, 74(1):1–22.
 Law et al. (2005) Law, M. H., Topchy, A. P., and Jain, A. K. (2005). Modelbased clustering with probabilistic constraints. In SDM. SIAM.
 Maurer and Pontil (2010) Maurer, A. and Pontil, M. (2010). kdimensional coding schemes in hilbert spaces. Information Theory, IEEE Transactions on, 56(11):5839–5846.
 Pollard (1984) Pollard, D. (1984). Convergence of stochastic processes. David Pollard.
 Valiant (1984) Valiant, L. G. (1984). A theory of the learnable. Communications of the ACM, 27(11):1134–1142.
 Von Luxburg and BenDavid (2005) Von Luxburg, U. and BenDavid, S. (2005). Towards a statistical theory of clustering. In Pascal workshop on statistics and optimization of clustering, pages 20–26.
 Wagstaff et al. (2001) Wagstaff, K., Cardie, C., Rogers, S., Schrödl, S., et al. (2001). Constrained kmeans clustering with background knowledge. In ICML, volume 1, pages 577–584.
 Xing et al. (2002) Xing, E. P., Jordan, M. I., Russell, S., and Ng, A. Y. (2002). Distance metric learning with application to clustering with sideinformation. In Advances in neural information processing systems, pages 505–512.
Comments
There are no comments yet.