# Representation Learning for Clustering: A Statistical Framework

We address the problem of communicating domain knowledge from a user to the designer of a clustering algorithm. We propose a protocol in which the user provides a clustering of a relatively small random sample of a data set. The algorithm designer then uses that sample to come up with a data representation under which k-means clustering results in a clustering (of the full data set) that is aligned with the user's clustering. We provide a formal statistical model for analyzing the sample complexity of learning a clustering representation with this paradigm. We then introduce a notion of capacity of a class of possible representations, in the spirit of the VC-dimension, showing that classes of representations that have finite such dimension can be successfully learned with sample size error bounds, and end our discussion with an analysis of that dimension for classes of representations induced by linear embeddings.

## Authors

• 9 publications
• 12 publications
05/21/2020

### Computationally efficient sparse clustering

We study statistical and computational limits of clustering when the mea...
06/26/2018

### Deep k-Means: Jointly Clustering with k-Means and Learning Representations

We study in this paper the problem of jointly clustering and learning re...
01/07/2022

### Probabilistic spatial clustering based on the Self Discipline Learning (SDL) model of autonomous learning

Unsupervised clustering algorithm can effectively reduce the dimension o...
01/16/2014

### Which Clustering Do You Want? Inducing Your Ideal Clustering with Minimal Feedback

While traditional research on text clustering has largely focused on gro...
10/27/2021

### Provable Lifelong Learning of Representations

In lifelong learning, the tasks (or classes) to be learned arrive sequen...
04/05/2019

### k-means clustering of extremes

The k-means clustering algorithm and its variant, the spherical k-means ...
02/02/2021

### A Basis Approach to Surface Clustering

This paper presents a novel method for clustering surfaces. The proposal...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Clustering can be thought as the task of automatically dividing a set of objects into “coherent” subsets. This definition is not concrete, but its vagueness allows it to serve as an umbrella term for a wide diversity of algorithmic paradigms. Clustering algorithms are being routinely applied in a huge variety of fields.

Given a dataset that needs to be clustered for some application, one can choose among a variety of different clustering algorithms, along with different pre-processing techniques, that are likely to result in dramatically different answers. It is therefore critical to incorporate prior knowledge about the data and the intended semantics of the clustering into the process of picking a clustering algorithm (or, clustering model selection). Regretfully, there seem to be no systematic tool for incorporation of domain expertise for clustering model selection, and such decisions are usually being made in embarrassingly ad hoc ways. This paper aims to address that critical deficiency in a formal statistical framework.

We approach the challenge by considering a scenario in which the domain expert (i.e., the intended user of the clustering) conveys her domain knowledge by providing a clustering of a small random subset of her data set. For example, consider a big customer service center that wishes to cluster incoming requests into groups to streamline their handling. Since the data base of requests is too large to be organized manually, the service center wishes to employ a clustering program. As the clustering designer, we would then ask the service center to pick a random sample of requests, manually cluster them, and show us the resulting grouping of that sample. The clustering tool then uses that sample clustering to pick a clustering method that, when applied to the full data set, will result in a clustering that follows the patterns demonstrated by that sample clustering. We address this paradigm from a statistical machine learning perspective. Aiming to achieve generalization guaranties for such an approach, it is essential to introduce some

inductive bias. We do that by restricting the clustering algorithm to a predetermined hypothesis class (or a set of concrete clustering algorithms).

In a recent Dagstuhl workshop, Blum (2014) proposed to do that by fixing a clustering algorithm, say -means, and searching for a metric over the data under which -means optimization yields a clustering that agrees with the training sample clustering. One should note that, given any domain set , for any -partitioning of , there exists some distance function over such that is the optimal -means clustering solution to the input 111This property is sometimes called -Richness. Consequently, to protect against potential overfitting, the class of potential distance functions should be constrained. In this paper, we provide (apparently the first) concrete formal framework for such a paradigm, as well as a generalization analysis of this approach.

In this work we focus on center based clustering - an important class of clustering algorithms. In these algorithms, the goal is to find a set of “centers” (or prototypes), and the clusters are the Voronoi cells induced by this set of centers. The objective of such a clustering is to minimize the expected value of some monotonically increasing function of the distances of points to their cluster centers. The k–means clustering objective is arguably the most popular clustering paradigm in this class. Currently, center-based clustering tools lack a vehicle for incorporating domain expertise. Domain knowledge is usually taken into account only through an ad hoc choice of input data representation. Regretfully, it might not be realistic to require the domain expert to translate sufficiently elaborate task-relevant knowledge into hand-crafted features.

As a model for learning representations, we assume that the user-desirable clustering can be approximated by first mapping the sample to some Euclidean (or Hilbert) space and then performing -means clustering in the mapped space (or equivalently, replacing the input data metric by some kernel and performing center-based clustering with respect to that kernel). However, the clustering algorithm is supposed to learn a suitable mapping based on the given sample clustering.

The main question addressed in this work is that of the sample complexity: what is the size of a sample, to be clustered by the domain expert, that suffices for finding a close-to-optimal mapping (i.e., a mapping that generalizes well on the test data)? Intuitively, this sample complexity depends on the richness of the class of potential mappings that the algorithm is choosing from. In standard supervised learning, there are well established notions of capacity of hypothesis classes (e.g., VC-dimension) that characterize the sample complexity of learning. This paper aims to provide such relevant notions of capacity for clustering.

### 1.1 Previous Work

In practice, there are methods that use some forms of supervision for clustering. These methods are sometimes called “semi-supervised clustering” (Basu et al. (2002, 2004); Kulis et al. (2009)). The most common method to convey such supervision is through a set of pairwise must/cannot-link constraints on the instances (Wagstaff et al. (2001)). A common way of using such information is by changing the objective of clustering so that violations of these constraints are penalized (Demiriz et al. (1999); Law et al. (2005); Basu et al. (2008)). Another approach, which is closer to ours, keeps the clustering optimization objective fixed, and instead, searches for a metric that best fits given constraints. The metric is learned based on some objective function over metrics ((Xing et al., 2002; Alipanahi et al., 2008)), so that pairs of instances marked must-link will be close in the new metric space (and cannot-link pairs be considered as far apart). The two above approaches can also be integrated (Bilenko et al. (2004)). However, these objective functions are usually rather ad hoc. In particular, it is not clear in what sense they are compatible with the adopted clustering algorithm (such as k-means clustering).

A different approach to the problem of communicating user expertise for the purpose of choosing a clustering tool is discussed in Ackerman et al. (2010). They considered a set of properties, or requirements, for clustering algorithms, and investigated which of those properties hold for various algorithms. The user can then pick the right algorithm based on the requirements that she wants the algorithm to meet. However, to turn such an approach into a practically useful tool, one will need to come up with properties that are relevant to the end user of clustering –a goal that is still far from being reached.

Statistical convergence rates of sample clustering to the optimal clustering, with respect to some data generating probability distribution, play a central role in our analysis. From that perspective, most relevant to our paper are results that provide generalization bounds for k-means clustering.

Ben-David (2007) proposed the first dimension-independent generalization bound for k-means clustering based on compression techniques. Biau et al. (2008) tightened this result by an analysis of Rademacher complexity. Maurer and Pontil (2010) investigated a more general framework, in which generalization bounds for k-means as well as other algorithms can be obtained. It should be noted that these results are about the standard clustering setup (without any supervised feedback), where the data representation is fixed and known to the clustering algorithm.

### 1.2 Contributions

Our first contribution is to provide a statistical framework to analyze the problem of learning representation for clustering. We assume that the expert has some implicit target clustering of the dataset in his mind. The learner however, is unaware of it, and instead has to select a mapping among a set of potential mappings, under which the result of k-means clustering will be similar to the target partition. An appropriate notion of loss function is introduced to quantify the success of the learner. Then, we define the analogous notion of PAC-learnability

222PAC stands for the well known notion of “probably approximately correct”, popularized by Valiant (1984). for the problem of learning representation for clustering.

The second contribution of the paper is the introduction of a combinatorial parameter, a specific notion of the capacity of the class of mappings, that determines the sample complexity of the clustering learning tasks. This combinatorial notion is a multivariate version of pseudo-dimension of a class of real-valued mappings. We show that there is uniform convergence of empirical losses to the true loss, over any class of embeddings, , at a rate that is determined by the proposed dimension of that . This implies that any empirical risk minimization algorithm (ERM) will successfully learn such a class from sample sizes upper bounded by those rates. Finally, we analyze a particular natural class –the class of linear mappings from to – and show that a roughly speaking, sample size of is sufficient to guarantee an -optimal representation.

The rest of this paper is organized as follows: Section 2 defines the problem setting. Then in Section  3, we investigate ERM-type algorithms and show that, “uniform convergence” is sufficient for them to work. Furthermore, this section presents the uniform convergence results and the proof of an upper bound for the sample complexity. Finally, we conclude in section 4 and provide some directions for future work.

## 2 Problem Setting

### 2.1 Preliminaries

Let be a finite domain set. A -clustering of is a partition of into subsets. If is a -clustering, we denote the subsets of the partition by , therefore we have . Let denote the set of all permutations over where denotes . The clustering difference between two clusterings, and , with respect to is defined by

 ΔX(C1,C2)=minσ∈πk1|X|k∑i=1|C1iΔC2σ(i)| (1)

where and denote the cardinality and the symmetric difference of sets respectively. For a sample , and (a partition of ), we define to be a partition of induced by , namely . Accordingly, the sample-based difference between two partitions is defined by

 ΔS(C1,C2)=ΔS(C1∣∣S,C2∣∣S) (2)

Let be a mapping from to , and

be a vector of

centers in . The clustering defined by is the partition over induced by the -Voronoi partition in . Namely,

 Cf(μ)=(C1,…Ck), where for all i,
 Ci={x∈X:∥f(x)−μi∥2≤∥f(x)−μj∥2 for all j≠i}

The k-means cost of clustering with a set of centers and with respect to a mapping is defined by

 COSTX(f,μ)=1|X|∑x∈Xminμi∈μ∥f(x)−μi∥22 (3)

The k-means clustering algorithm finds the set of centers that minimize this cost333We assume that the solution to k-means clustering is unique. We will elaborate about this issue in the next sections.. In other words,

 μfX=argminμCOSTX(f,μ) (4)

Also, for a partition and mapping , we can define the cost of clustering as follows.

 COSTX(f,C)=1|X|∑i∈[k]minμj∑x∈Ci∥f(x)−μj∥22 (5)

For a mapping as above, let denote the -means clustering of induced by , namely

 CfX=Cf(μfX) (6)

The difference between two mappings and with respect to is defined by the difference between the result of k-means clustering using these mappings. Formally,

 ΔX(f1,f2)=ΔX(Cf1X,Cf2X) (7)

The following proposition shows the “-richness” property of k-means objective.

###### Proposition 1.

Let be a domain set. For every -clustering of , , and every , there exist a mapping such that .

###### Proof.

The mapping can be picked such that it collapses each cluster into a single point in (and so the image of under mapping will be just single points in ). The result of k-means clustering under such mapping will be . ∎

In this paper, we investigate the transductive setup, where there is a given data set, known to the learner, that needs to be clustered. Clustering often occurs as a task over some data generating distribution (e.g., Von Luxburg and Ben-David (2005)). The current work can be readily extended to that setting. However, in that case, we assume that the clustering algorithm gets, on top of the clustered sample, a large unclustered sample drawn form that data generating distribution.

### 2.2 Formal Problem Statement

Let be the target -clustering of . A (supervised) representation learner for clustering, takes as input a sample and its clustering, , and outputs a mapping from a set of potential mappings . In the following, PAC stands for the notion of “probably approximately correct”.

###### Definition 1.

PAC Supervised Representation Learner for K-Means (PAC-SRLK)

Let be a set of mappings from to . A representation learning algorithm is a PAC-SRLK with sample complexity with respect to , if for every , every domain set and every clustering of , , the following holds:

if is a randomly (uniformly) selected subset of of size at least , then with probability at least

 ΔX(C∗,CfAX)≤inff∈FΔX(C∗,CfX)+ϵ (8)

where , is the output of the algorithm.

This can be regarded as a formal PAC framework to analyze the problem of learning representation for k-means clustering. The learner is compared to the best mapping in the class .

A natural question is providing bounds on the sample complexity of PAC-SRLK with respect to . Intuitively, for richer classes of mappings, we need larger clustered samples. Therefore, we need to introduce an appropriate notion of “capacity” for and bound the sample complexity based on it. This is addressed in the next sections.

## 3 Analysis and Results

### 3.1 Empirical Risk Minimization

In order to prove an upper bound for the sample complexity of representation learning for clustering, we need to consider an algorithm, and prove a sample complexity bound for it. Here, we show that any ERM-type algorithm can be used for this purpose. Therefore, we will be able to prove an upper bound for the sample complexity of PAC-SRLK.

Let be a class of mappings and be the domain set. A TERM444TERM stands for Transductive Empirical Risk Minimizer learner for takes as input a sample and its clustering and outputs:

 ATERM(S,Y)=argminf∈FΔS(CfX∣∣S,Y) (9)

Note that we call it transductive, because it is implicitly assumed that it has access to unlabeled dataset (i.e., ). A TERM algorithm goes over all mappings in and selects the mapping which is the most consistent mapping with the given clustering: the mapping under which if we perform k-means clustering of , the sample-based -difference between the result and is minimized.

Note that we are not studying this algorithm as a computational tool; we only use it to show an upper bound for the sample complexity.

Intuitively, this algorithm will work well when the empirical -difference and the true -difference of the mappings in the class are close to each other. In this case, by minimizing the empirical difference, the algorithm will automatically minimize the true difference as well. In order to formalize this idea, we define the notion of “representativeness” of a sample.

###### Definition 2.

(-Representative Sample) Let be a class of mappings from to . A sample is -representative with respect to , and the clustering , if for every the following holds

 |ΔX(C∗,CfX)−ΔS(C∗,CfX))|≤ϵ (10)

The following theorem shows that for the TERM algorithm to work, it is sufficient to supply it with a representative sample.

###### Theorem 1.

(Sufficiency of Uniform Convergence) Let be a set of mappings from to . If is an -representative sample with respect to , and then

 ΔX(C∗,C^fX)≤ΔX(C∗,Cf∗X)+ϵ (11)

where and .

###### Proof.

Using -representativeness of and the fact that is the empirical minimizer of the loss function, we have

 ΔX(C∗,C^fX)≤ΔS(C∗,C^fX)+ϵ2 (12)
 ≤ΔS(C∗,Cf∗X)+ϵ2 (13)
 ≤ΔX(C∗,Cf∗X)+ϵ2+ϵ2 (14)
 ≤ΔX(C∗,Cf∗X)+ϵ (15)

Therefore, we just need to provide an upper bound for the sample complexity of uniform convergence: “how many instances do we need to make sure that with high probability our sample is -representative?”

### 3.2 Classes of Mappings with a Uniqueness Property

In general, the solution to k-means clustering may not be unique. Therefore, the learner may end up with finding a mapping that corresponds to multiple different clusterings. This is not desirable, because in this case, the output of the learner will not be interpretable. Therefore, it is reasonable to choose the class of potential mappings in a way that it includes only the mappings under which the solution is unique.

In order to make this idea concrete, we need to define an appropriate notion of uniqueness. We use a notion similar to the one introduced by Balcan et al. (2009) with a slight modification555Our notion is additive in both parameters rather than multiplicative.

###### Definition 3.

(-Uniqueness) We say that k-means clustering for domain under mapping has a -unique solution, if every -optimal solution of the k-means cost is -close to the optimal solution. Formally, the solution is -unique if for every partition that satisfies

 COSTX(f,P)

would also satisfy

 ΔX(CfX,P)<ϵ (17)

In the degenerate case where the optimal solution to k-means is not unique itself (and so is not well-defined), we say that the solution is not -unique.

It can be noted that the definition of -uniqueness not only requires the optimal solution to k-means clustering to be unique, but also all the “near-optimal” minimizers of the k-means clustering cost should be “similar”. This is a natural strengthening of the uniqueness condition, to guard against cases where there are -optimizers of the cost function (for arbitrarily small ) with totally different solutions.

Now that we have a definition for uniqueness, we can define the set of mappings for under which the solution is unique. We say that a class of mappings has -uniqueness property with respect to , if every mapping in has -uniqueness property over .

Note that given an arbitrary class of mappings , we can find a subset of it that satisfies -uniqueness property over . Also, as argued above, this subset is the useful subset to work with. Therefore, in the rest of the paper, we investigate learning for classes with -uniqueness property. In the next section, we prove uniform convergence results for such classes.

### 3.3 Uniform Convergence Results

In Section 3.1, we defined the notion of -representative samples. Also, we proved that if a TERM algorithm is fed with such a representative sample, it will work satisfactorily. The most technical part of the proof is then about the question “how large should be the sample in order to make sure that with high probability it is actually a representative sample?”

In order to formalize this notion, let be a set of mappings from a domain to 666In the analysis, for simplicity, we will assume that the set of mappings is a function to the bounded space wherever needed. Define the sample complexity of uniform convergence, , as the minimum number such that for every fixed partition , if is a randomly (uniformly) selected subset of with size , then with probability at least , for all we have

 |ΔX(C∗,CfX)−ΔS(C∗,CfX)|≤ϵ (18)

The technical part of this paper is devoted to provide an upper bound for this sample complexity.

#### 3.3.1 Preliminaries

###### Definition 4.

(-cover and covering number) Let be a set of mappings from to . A subset is called an -cover for with respect to the metric if for every there exists such that . The covering number, is the size of the smallest -cover of with respect to .

In the above definition, we did not specify the metric . In our analysis, we are interested in the distance with respect to , namely:

 dXL1(f1,f2)=1|X|∑x∈X∥f1(x)−f2(x)∥2 (19)

Note that the mappings we consider are not real-valued functions, but their output is an -dimensional vector. This is in contrast to the usual analysis used for learning real-valued functions. If and are real-valued, then distance is defined by

 dXL1(f1,f2)=1|X|∑x∈X|f1(x)−f2(x)| (20)

We will prove sample complexity bounds for our problem based on the -covering number of the set of mappings. However, it will be beneficial to have a bound based on some notion of capacity, similar to VC-dimension, as well. This will help in better understanding and easier analysis of sample complexity of different classes. While VC-dimension is defined for binary valued functions, we need a similar notion for functions with outputs in . For real-valued functions, we have such notion, called pseudo-dimension (Pollard (1984)).

###### Definition 5.

(Pseudo-Dimension) Let be a set of functions from to . Let be a subset of . Then is pseudo-shattered by if there are real numbers such that for every , there is a function with for . Pseudo dimension of , called , is the size of the largest shattered set.

It can be shown (e.g., Theorem 18.4. in Anthony and Bartlett (2009)) that for a real-valued class , if then where hides logarithmic factors of . In the next sections, we will generalize this notion to -valued functions.

#### 3.3.2 Reduction to Binary Hypothesis Classes

Let be two mappings and be a permutation over . Define the binary-valued function as follows

 hf1,f2σ(x)={1x∈∪ki=1(Cf1iΔCf2σ(i))0otherwise (21)

Let be the set of all such functions with respect to and :

 HFσ={hf1,f2σ(.):f1,f2∈F} (22)

Finally, let be the union of all over all choices of . Formally, if is the set of all permutations over , then

 HF=∪σ∈πHFσ (23)

For a set , and a binary function , let . We now show that a uniform convergence result with respect to is sufficient to have uniform convergence for the -difference function. Therefore, we will be able to investigate conditions for uniform convergence of rather than the -difference function.

###### Theorem 2.

Let be a domain set, be a set of mappings, and be defined as above. If is such that

 ∀h∈HF,|h(S)−h(X)|≤ϵ (24)

then will be -representative with respect to , i.e., for all we will have

 |ΔX(Cf1X,Cf2X)−ΔS(Cf1X,Cf2X)|≤ϵ (25)
###### Proof.
 |ΔS(Cf1X,Cf2X)−ΔX(Cf1X,Cf2X)| (26)
 =∣∣ ∣∣(minσ1|S|∑x∈Shf1,f2σ)−(minσ1|X|∑x∈Xhf1,f2σ)∣∣ ∣∣ (27)
 ≤∣∣ ∣∣maxσ(1|S|∑x∈Shf1,f2σ−1|X|∑x∈Xhf1,f2σ)∣∣ ∣∣ (28)
 ≤∣∣maxσ(hf1,f2σ(S)−hf1,f2σ(X))∣∣≤ϵ (29)

The fact that is a class of binary-valued functions enables us to provide sample complexity bounds based on VC-dimension of this class. However, providing bounds based on VC-Dim is not sufficient, in the sense that it is not convenient to work with the class . Instead, it will be nice if we can prove bounds directly based on the capacity of the class of mappings, . In the next section, we address this issue.

#### 3.3.3 L1-Covering Number and Uniform Convergence

The classes introduced in the previous section, and , are binary hypothesis classes. Also, we have shown that proving a uniform convergence result for is sufficient for our purpose. In this section, we show that a bound on the covering number of is sufficient to prove uniform convergence for .

In Section 3.2, we argued that we only care about the classes that have -uniqueness property. In the rest of this section, assume that is a class of mappings from to that satisfies -uniqueness property.

###### Lemma 1.

Let . If then

We leave the proof of this lemma for the appendix, and present the next lemma.

###### Lemma 2.

Let be defined as in the previous section. Then,

 N(HF,dXL1,2ϵ)≤k!N(F,dXL1,η12) (30)
###### Proof.

Let be the -cover corresponding to the covering number . Based on the previous lemma, is a -cover for . But we have only permutations of , therefore, the covering number for is at most times larger than . This proves the result. ∎

Basically, this means that if we have a small covering number for the mappings, we will have the uniform convergence result we were looking for. The following theorem proves this result.

###### Theorem 3.

Let be a set of mappings with -uniqueness property. Then there for some constant we have

 mUCF(ϵ,δ)≤O(logk!+logN(F,dXL1,ηα)+log(1δ)ϵ2) (31)
###### Proof.

Following the previous lemma, if we have a small -covering number for , we will also have a small covering number for as well. But based on standard uniform convergence theory, if a hypothesis class has small covering number, then it has uniform convergence property. More precisely, (e.g., Theorem 17.1 in Anthony and Bartlett (2009)) we have:

 mUCHF(ϵ0,δ)≤O(logN(HF,dXL1,ϵ016)+log(1δ)ϵ20) (32)

Applying Lemma 2 to the above proves the result. ∎

#### 3.3.4 Bounding L1-Covering Number

In the previous section, we proved if the covering number of the class of mappings is bounded, then we will have uniform convergence. However, it is desirable to have a bound with respect to a combinatorial dimension of the class (rather than the covering number). Therefore, we will generalize the notion of pseudo-dimension for the class of mappings that take value in .

Let be a set of mappings form to . For every mapping , define real-valued functions such that . Now let . This means that are classes of real-valued functions. Now we define pseudo-dimension of as follow.

 Pdim(F)=nmaxi∈[n]Pdim(Fi) (33)
###### Proposition 2.

Let be a set of mappings form to . If then where hides logarithmic factors.

###### Proof.

The result follows from the corresponding result for bounding covering number of real-valued functions based on pseudo-dimension mentioned in the preliminaries section. The reason is that we can create a cover by composition of the -covers of all . However, this will at most introduce a factor of in the logarithm of the covering number. ∎

Therefore, we can rewrite the result of the previous section in terms of pseudo-dimension.

###### Theorem 4.

Let be a class of mappings with -uniqueness property. Then

 mUCF(ϵ,δ)≤O(k+Pdim(F)+log(1δ)ϵ2) (34)

where hides logarithmic factors of and .

### 3.4 Sample Complexity of PAC-SRLK

In Section 3.1, we showed that uniform convergence is sufficient for a TERM algorithm to work. Also, in the previous section, we proved a bound for the sample complexity of uniform convergence. The following theorem, which is the main technical result of this paper, combines these two and provides a sample complexity upper bound for PAC-SRLK framework.

###### Theorem 5.

Let be a class of -unique mappings. Then the sample complexity of learning representation for -means clustering with respect to is upper bounded by

 mF(ϵ,δ)≤O(k+Pdim(F)+log(1δ)ϵ2) (35)

where hides logarithmic factors of and .

The proof is done by combining Theorems 1 and 4.

The following result shows an upper bound for the sample complexity of learning linear mappings (or equivalently, Mahalanobis metrics).

###### Corollary 1.

Let be a set of -unique linear mappings from to . Then we have

 mF(ϵ,δ)≤O(k+d1d2+log(1δ)ϵ2) (36)
###### Proof.

It is a standard result that the pseudo-dimension of a vector space of real-valued functions is just the dimensionality of the space (in our case ) (e.g., Theorem 11.4 in Anthony and Bartlett (2009)). Also, based on our definition of for -valued functions, it should scale by a factor of . ∎

## 4 Conclusions and Open Problems

In this paper we provided a formal statistical framework for learning the representation (i.e., a mapping) for k-means clustering based on supervised feedback. The learner, unaware of the target clustering of the domain, is given a clustering of a sample set. The learner’s task is then finding a mapping function (among a class of mappings) under which the result of k-means clustering of the domain is as close as possible to the true clustering. This framework was called PAC-SRLK.

A notion of -representativeness was introduced, and it was proved that any ERM-type algorithm that has access to such a sample will work satisfactorily. Finally, a technical uniform convergence result was proved to make sure that a large enough sample is (with high probability) -representative. This was used to prove an upper bound for the sample complexity of PAC-SRLK based on covering numbers of the set of mappings. Furthermore, a notion of pseudo-dimension for the class of mappings was defined, and the sample complexity was upper bounded based on it.

Note that in the analysis, the notion of -uniqueness (similar to that of Balcan et al. (2009)) was used and it was argued that it is reasonable to require the learner to output a mapping under which the solution is “unique” (because otherwise the output of k-means clustering would not be interpretable). Therefore, in the analysis, we assumed that the class of potential mappings has the -uniqueness property.

It can be noted that we did not analyze the computational complexity of algorithms for PAC-SRLK framework. We leave this analysis to the future work. We just note that a similar notion of uniqueness proposed by Balcan et al. (2009) resulted in being able to efficiently solve the k-means clustering algorithm.

One other observation is that representation learning can be regarded as a special case of metric learning; because for every mapping, we can define a distance function that computes the distance in the mapped space. In this light, we can make the problem more general by making the learner find a distance function rather than a mapping. This is more challenging to analyze, because we do not even know a generalization bound for center-based clustering under general distance functions. An open question will be providing such general results.

## 5 Appendix

Proof of Lemma 1. Let be a set of mappings that have -uniqueness property. Let and . We need to prove that . In order to prove this, note that due to triangular inequality, we have

 ΔX(f1,f2)=ΔX(Cf1(μf1),Cf2(μf2))≤ΔX(Cf1(μf1),Cf1(μf2))+ΔX(Cf1(μf2),Cf2(μf2)) (37)

Therefore, it will be sufficient to show that each of the -terms above is smaller than . We start by proving a useful lemma.

###### Lemma 3.

Let and . Let be an arbitrary set of centers in . Then

 |COSTX(f1,μ)−COSTX(f2,μ)|<η2
###### Proof.
 |COSTX(f1,μ)−COSTX(f2,μ)|=∣∣∣(1|X|∑x∈Xminμj∈μ∥f1(x)−μj∥2)−(1|X|∑x∈Xminμj∈μ∥f2(x)−μj∥2)∣∣∣ (38)
 ≤1|X|∑x∈Xmaxμj∈μ∣∣∥f1(x)−μj∥2−∥f2(x)−μj∥2∣∣ (39)
 =1|X|∑x∈Xmaxμj∈μ∣∣∥f1(x)∥2−∥f2(x)∥2−2<μj,f1−f2>∣∣ (40)
 =1|X|∑x∈Xmaxμj∈μ∣∣∣∣ (41)
 ≤3|X|∑x∈X∥f1−f2∥≤3η6≤η2 (42)

Now we are ready to prove that the first -term is smaller than , i.e., . But to do so, we only need to show that ; because in that case, due to -uniqueness property of , the result will follow. Now, using Lemma 3, we have

 COSTX(f1,μf2)−COSTX(f1,μf1) (43)
 ≤(COSTX(f2,μf2)+η2)−COSTX(f1,μf1) (44)
 =minμ(COSTX(f2,μ))−minμ(COSTX(f1,μ))+η2 (45)
 ≤maxμ(COSTX(f2,μ)−COSTX(f1,μ))+η2 (46)
 ≤η2+η2≤η (47)

where in the first and the last line we used Lemma 3.

Finally, we need to prove the second -inequality, i.e., . Assume contrary. But based on -uniqueness property of , we conclude that . In the following, we prove that this cannot be true, and hence a contradiction.

Let . Then, based on the boundedness of , and we have:

 COSTX(f2,Cf1(μf2))−COSTX(f2,Cf2(μf2)) (48)
 =(1|X|∑x∈X∥f2(x)−mx∥2)−COSTX(f2,μ2) (49)
 =(1|X|∑x∈X∥f2(x)−f1(x)+f1(x)−mx∥2)−COSTX(f2,μ2) (50)
 =1|X|∑x∈X∥f2(x)−f1(x)∥2+1|X|∑x∈X∥f1(x)−mx∥2+1|X|∑x∈X2−COSTX(f2,μ2) (51)
 ≤2|X|∑x∈X∥f2(x)−f1(x)∥+COSTX(f1,μ1)+4|X|∑x∈X∥f2(x)−f1(x)∥−COSTX(f2,μ2) (52)
 ≤6|X|∑x∈X∥f2(x)−f1(x)∥+(COSTX(f1,μ1)−COSTX(f2,μ2)) (53)
 ≤6η12+η2≤η (54)

## References

• Ackerman et al. (2010) Ackerman, M., Ben-David, S., and Loker, D. (2010). Towards property-based classification of clustering paradigms. In Advances in Neural Information Processing Systems, pages 10–18.
• Alipanahi et al. (2008) Alipanahi, B., Biggs, M., Ghodsi, A., et al. (2008). Distance metric learning vs. fisher discriminant analysis. In

Proceedings of the 23rd national conference on Artificial intelligence

, pages 598–603.
• Anthony and Bartlett (2009) Anthony, M. and Bartlett, P. L. (2009). Neural network learning: Theoretical foundations. cambridge university press.
• Balcan et al. (2009) Balcan, M.-F., Blum, A., and Gupta, A. (2009). Approximate clustering without the approximation. In Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1068–1077. Society for Industrial and Applied Mathematics.
• Basu et al. (2002) Basu, S., Banerjee, A., and Mooney, R. (2002). Semi-supervised clustering by seeding. In In Proceedings of 19th International Conference on Machine Learning (ICML-2002.
• Basu et al. (2004) Basu, S., Bilenko, M., and Mooney, R. J. (2004). A probabilistic framework for semi-supervised clustering. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 59–68. ACM.
• Basu et al. (2008) Basu, S., Davidson, I., and Wagstaff, K. (2008). Constrained clustering: Advances in algorithms, theory, and applications. CRC Press.
• Ben-David (2007) Ben-David, S. (2007). A framework for statistical clustering with constant time approximation algorithms for k-median and k-means clustering. Machine Learning, 66(2-3):243–257.
• Biau et al. (2008) Biau, G., Devroye, L., and Lugosi, G. (2008). On the performance of clustering in hilbert spaces. Information Theory, IEEE Transactions on, 54(2):781–790.
• Bilenko et al. (2004) Bilenko, M., Basu, S., and Mooney, R. J. (2004). Integrating constraints and metric learning in semi-supervised clustering. In Proceedings of the twenty-first international conference on Machine learning, page 11. ACM.
• Blum (2014) Blum, A. (2014). Approximation-stability and perturbation-stability. In DAGSTUHL Workshop on Analysis of Algorithms Beyond the Worst Case.
• Demiriz et al. (1999) Demiriz, A., Bennett, K. P., and Embrechts, M. J. (1999).

Semi-supervised clustering using genetic algorithms.

Artificial neural networks in engineering (ANNIE-99), pages 809–814.
• Kulis et al. (2009) Kulis, B., Basu, S., Dhillon, I., and Mooney, R. (2009). Semi-supervised graph clustering: a kernel approach. Machine learning, 74(1):1–22.
• Law et al. (2005) Law, M. H., Topchy, A. P., and Jain, A. K. (2005). Model-based clustering with probabilistic constraints. In SDM. SIAM.
• Maurer and Pontil (2010) Maurer, A. and Pontil, M. (2010). k-dimensional coding schemes in hilbert spaces. Information Theory, IEEE Transactions on, 56(11):5839–5846.
• Pollard (1984) Pollard, D. (1984). Convergence of stochastic processes. David Pollard.
• Valiant (1984) Valiant, L. G. (1984). A theory of the learnable. Communications of the ACM, 27(11):1134–1142.
• Von Luxburg and Ben-David (2005) Von Luxburg, U. and Ben-David, S. (2005). Towards a statistical theory of clustering. In Pascal workshop on statistics and optimization of clustering, pages 20–26.
• Wagstaff et al. (2001) Wagstaff, K., Cardie, C., Rogers, S., Schrödl, S., et al. (2001). Constrained k-means clustering with background knowledge. In ICML, volume 1, pages 577–584.
• Xing et al. (2002) Xing, E. P., Jordan, M. I., Russell, S., and Ng, A. Y. (2002). Distance metric learning with application to clustering with side-information. In Advances in neural information processing systems, pages 505–512.