When is Clustering Perturbation Robust?

01/22/2016
by   Margareta Ackerman, et al.
0

Clustering is a fundamental data mining tool that aims to divide data into groups of similar items. Generally, intuition about clustering reflects the ideal case -- exact data sets endowed with flawless dissimilarity between individual instances. In practice however, these cases are in the minority, and clustering applications are typically characterized by noisy data sets with approximate pairwise dissimilarities. As such, the efficacy of clustering methods in practical applications necessitates robustness to perturbations. In this paper, we perform a formal analysis of perturbation robustness, revealing that the extent to which algorithms can exhibit this desirable characteristic is inherently limited, and identifying the types of structures that allow popular clustering paradigms to discover meaningful clusters in spite of faulty data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

02/06/2021

An empirical comparison and characterisation of nine popular clustering methods

Nine popular clustering methods are applied to 42 real data sets. The ai...
07/16/2018

Novel Feature-Based Clustering of Micro-Panel Data (CluMP)

Micro-panel data are collected and analysed in many research and industr...
10/24/2019

Improving Diarization Robustness using Diversification, Randomization and the DOVER Algorithm

Speaker diarization based on bottom-up clustering of speech segments by ...
08/24/2018

To Cluster, or Not to Cluster: An Analysis of Clusterability Methods

Clustering is an essential data mining tool that aims to discover inhere...
11/08/2019

Subspace Clustering with Active Learning

Subspace clustering is a growing field of unsupervised learning that has...
04/28/2018

Clustering Perturbation Resilient Instances

Euclidean k-means is a problem that is NP-hard in the worst-case but oft...
10/19/2015

Clustering is Easy When ....What?

It is well known that most of the common clustering objectives are NP-ha...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Clustering is a popular data mining tool, due in no small part to its general and intuitive goal of dividing data into groups of similar items. Yet in spite of this seemingly simple task, successful application of clustering techniques in practice is oftentimes challenging. In particular, there are inherent difficulties in the data collection process and design of pairwise dissimilarity measures, both of which significantly impact the behavior of clustering algorithms.

Intuition about clustering often reflects the ideal case – flawless data sets with well-suited dissimilarity between individual instances. In practice however, these cases are rare. Errors are introduced into a data set for a wide variety of reasons; from precision of instruments (a student’s ruler to the Large Hadron Collider alike have a set precision), to human error when data is user-reported (common in the social sciences). Additionally, the dissimilarity between pairwise instances is often based on heuristic measures, particularly when non-numeric attributes are present. Furthermore, the dynamic nature of prominent clustering applications (such as personalization for recommendation systems) implies that by the time the data has been clustered, it has already changed.

The ubiquity of flawed input poses a serious challenge. If clustering is to operate strictly under the assumption of ideal data, its applicability would be reduced to fairly rare applications where such data can be attained. As such, it would be desirable for clustering algorithms to provide some qualitative guarantees about their output when partitioning noisy data. This leads us to explore whether there are any algorithms for which such guarantees can be provided.

Although data can be faulty in a variety of ways, our focus here is on inaccuracies of pairwise distances. At a minimum, small perturbation to data should not radically affect the output of an algorithm. It would be natural to expect that some clustering techniques are more robust than others, allowing users to rely on perturbation robust techniques when pairwise distances are inexact.

However, our investigation reveals that no reasonable clustering algorithm exhibits this desirable characteristic. In fact, both additive and multiplicative perturbation robustness are unrealistic requirements. We show that no clustering algorithm can satisfy robustness to perturbation without violating even more fundamental requirements. Not only do existing methods lack this desirable characteristic, but our findings also preclude the possibility of designing novel perturbation robust clustering methods.

Perhaps it is already surprising that no reasonable clustering algorithm can be perfectly perturbation robust, but our results go further. Instead of requiring that the clustering remain unchanged following a perturbation, we allow up to two-thirds of all pairwise distances to change (from in-cluster to between-cluster, or vice-versa). It turns our that this substantial relaxation doesn’t overcome our impossibility theorem.

Luckily, further exploration paints a more optimistic picture. A careful examination of this issue requires a look back to the underlying goal of clustering, which is to discover clustering structure in data when such structure is present. Our investigation suggests that sensitivity to small perturbations is inevitable only on unclusterable instances, for which clustering is inherently ill-suited. As such, it can be argued that whether an algorithm exhibits robustness on such data is inconsequential.

On the other hand, we show that when data is endowed with inherent structure, existing methods can often successfully reveal that structure even on faulty (perturbed) data. We investigate the type of cluster structures required for the success of popular clustering techniques, showing that the robustness of -means and related methods is directly proportional to the degree of inherent cluster structure. Similarly, we show that popular linkage-based techniques are robust when clusters are well-separated. Furthermore, different cluster structures are necessary for different algorithms to exhibit robustness to perturbations.

1.1 Previous work

This work follows a line of research on theoretical foundations of clustering. Efforts in the field began as early as the 1970s with the pioneering work of Wright Wright (1973) on axioms of clustering, as well analysis of clustering properties by Fisher et al Fisher and Van Ness (1971) and Jardine et al Jardine and Sibson (1971), among others. This field saw a renewed surge of activity following Kleinberg’s Kleinberg (2003) famous impossibility theorem, when he showed that no clustering function can simultaneously satisfy three simple properties. Also related to our work is a framework for selecting clustering methods based on differences in their input-output behavior Ackerman et al. (2010b, 2012); Jardine and Sibson (1971); Zadeh and Ben-David (2009); Ackerman et al. (2010a, 2013) as well as research on clusterability, which aims to quantify the degree of inherent cluster structure in data Ben-David (2015); Ackerman and Ben-David (2009); Balcan et al. (2008a, b); Ostrovsky et al. (2006).

Previous work on perturbation robustness studies it from a computational perspective by identifying new efficient algorithms for robust instances Bilu and Linial (2010); Ackerman and Ben-David (2009); Awasthi et al. (2012). Ben-David and Reyzin Ben-David and Reyzin (2014) recently studied corresponding NP-hardness lower bounds.

In this paper, we take a fresh look at perturbation robustness. We begin our investigation by asking when perturbation robustness is possible to attain. After proving that robustness to perturbations cannot be achieved as a data-independent property of an algorithm, we seek to understand when popular clustering paradigms satisfy this requirement. Our analysis of established methods is an essential complement to efforts in algorithmic development, as the need for understanding established methods is amplified by the fact that most clustering users rely on a small number of well-known techniques. Our results demonstrate the type of cluster structures required for robustness of popular clustering paradigms.

2 Definitions and notation

Clustering is a wide and heterogeneous domain. For most of this paper, we focus on a basic sub-domain where the input to a clustering function is a finite set of points endowed with a between-points dissimilarity function, and the number of clusters (), and the output is a partition of that domain.

A dissimilarity function is a symmetric function , such that for all . The data sets that we consider are pairs , where is some finite domain set and is a dissimilarity function over .

A -clustering of a data set is a partition of into disjoint subsets (or, clusters) of (so, ). A clustering of is a -clustering of for some .

For a clustering , let denote the number of clusters in and denote the number of points in a cluster . For a domain , denotes the number of points in , which we denote by when the domain is clear from context. We write if and are both in some cluster ; and otherwise. This is an equivalence relation.

The Hamming distance between clusterings and of the same domain set is defined by

where denotes the logical XOR operation.

That is, the difference is the number of edges that disagree, being in-cluster in one of the clusterings and between-cluster in the other. The maximum distance between clusterings is when the Hamming distance is .

Lastly, we formally define clustering functions.

Definition 1 (Clustering function).

A clustering function is a function that takes as input a pair and a parameter , and outputs a -clustering of the domain .

3 Perturbation robustness as a property of an algorithm

Whenever a user is faced with the task of clustering faulty data, it would be natural to select an algorithm that is robust to perturbations of pairwise dissimilarities. As such, we begin our study of perturbation robustness by casting it as a property of an algorithm. If we could classify algorithms based on whether or not (or to what degree) they are perturbation robust, then clustering users could incorporate this information when making decisions regarding which algorithms to apply on their data.

First, we define what it means to perturb a dissimilarity function.

Definition 2 (-multiplicative perturbation of a dissimilarity function).

Given a pair of dissimilarity functions and over a domain , is an -multiplicative-perturbation of , for , if for all , .

Additive perturbation of a dissimilarity function is defined analogously.

Definition 3 (-additive perturbation of a dissimilarity function).

Given a pair of dissimilarity functions and over a domain , is an -additive perturbation of , for , if for all , .

It is important to note that all of our results hold for both multiplicative and additive perturbation robustness. Perturbation robust algorithms should be invariant to data perturbations; that is, if data is perturbed, then the output of the algorithm shouldn’t change. This view of perturbation robustness is not only intuitive, but is also based on previous formulations Reyzin (2012); Bilu and Linial (2010); Awasthi et al. (2012) (This can be formalized as a property of clustering functions by setting in Definition 4 below).

However, requiring that the partitioning be identical before and after perturbation is provably too strict a requirement for clustering algorithms, as it can only hold for functions that effectively ignore all pairwise distances. That is, this notion of perturbation robustness only holds for clustering functions that, given any domain set and integer , produce the same partitioning regardless of the setting of (See Section 1 of the Appendix for details).

As such, we introduce a relaxation that allows some error in the output of the algorithm on perturbed data. From a practical point of view, it is likely that a user who has only a perturbation of the true data set is likely to be satisfied with an approximately correct solution.

Definition 4.

A clustering function is -multiplicative perturbation robust if, given any data set and , whenever is an -multiplicative perturbation of ,

Additive perturbation robustness is defined analogously, by replacing the -multiplicative perturbation of the dissimilarity function with an -additive perturbation.

Figure 1: An illustration of the three-body rule. Based on the main objective of clustering, which is to group similar items together, this rule requires that whenever an algorithm is given exactly three points and the number of clusters is two, then it groups the two closest elements.

Despite this substantial relaxation, perturbation robustness for as high as 2/3 is inherently incompatible with even more elementary requirements, as shown below.

3.1 Impossibility theorem for clustering functions

We now proceed to show that perturbation robustness is too strong a requirement for clustering algorithms, and as such neither existing nor novel techniques can have this desirable characteristic.

Particularly notable is that the impossibility results persists when is as high as , meaning that a perturbation is allowed to change up to two-thirds of all pairwise distances from in-cluster to between-cluster, or vise-versa. As such, we show that no reasonable clustering algorithm can preserve more than a third of its pairwise distances after a perturbation.

The following impossibility result derives from the pioneering work of Wright Wright (1973) on axioms of clustering. Wright originally proposed his axioms in Euclidean space, here we generalize them for arbitrary pairwise dissimilarities.

The first axiom we discuss follows from Wright’s 11th axiom, and captures the very essence of clustering: to group similar items. This property considers an elementary scenario, requiring that given exactly three points, an algorithm asked for two clusters should group the two closest elements. See Figure 1 for an illustration. A special case of this rule occurs when the three elements lie on the real line, in which case the furthest endpoint should be placed in its own cluster.

Definition 5 (Three-body rule).

Given a data set , if and , then .

Wright’s 6th axiom, and the final one we consider here, requires that replicating all data points by the same number should not change the clustering output. Outside of Euclidean space, we replicate a point by adding a new element and setting for all .

Definition 6 (Replication invariance).

Given any positive integer , if all points are replicated times, then the partitioning of the original data is unchanged and all replicas lie in the same cluster as their original element.

Not only are these two axioms natural, as violating them leads to counterintuitive behavior, but they also hold for common techniques. It is easy to show that they are satisfied by common clustering paradigms, including cost-based methods such as -means, -median, and -medoids, as well as linkage-based techniques, such as single-linkage, average-linkage and complete-linkage.

We now prove that no clustering function that satisfies the three-body rule and replication invariance can be perturbation robust. Furthermore, our result holds for all values of . Note that the following result applies to arbitrarily large data sets, for both multiplicative and additive perturbations.

Theorem 1.

For any , , and , there is no clustering function that satisfies

  1. -multiplicative perturbation robustness, replication invariance, and the three-body rule, and

  2. -additive perturbation robustness, replication invariance, and the three-body rule.

    Further, the result holds for arbitrarily large data.

Proof.

We proceed by contradiction, assuming that there exists a clustering function that is replication invariant, adheres to the three-body rule, and is -multiplicative perturbation robust for some .

Consider a data set with a distance function such that and . By the three-body rule, . We now replicate each point an arbitrary number of times, , creating three sets such that all points that are replicas of the point and itself belong to , all points that are replications of the point and itself belong to , and similarly for . By replication invariance, .

Next, we apply an -multiplicative perturbation, creating distance function such that and . By the three-body rule, , and yet -multiplicative perturbation robustness requires that the Hamming distance between and must be less than . But as the Hamming distance between and is exactly , we reach a contradiction.

For additive perturbation, set so that and . By the three-body rule, . As for the multiplicative case, we replicate each point times, creating three sets . By replication invariance, . We apply an -additive perturbation to make distance function such that and . By the three-body rule, , and yet -additive perturbation robustness requires that the Hamming distance between and must be less than , reaching a contradiction. ∎

Note that the above result holds if the data is in Euclidean space. This allows us to view perturbations as small movements in space, required to satisfy certain constraints such as the triangle inequality as well as adhering to the dissimilarity constraints required by Definitions 2 and 3. See supplementary material for details.

4 Perturbation robustness as a property of data

The above section demonstrates an inherent limitation of perturbation robustness as a property of clustering algorithms, showing that no reasonable clustering algorithm can exhibit this desirable characteristic. However, it turns out that perturbation robustness is possible to achieve when we restrict our attention to data endowed with inherent structure.

As such, perturbation robustness becomes a property of both an algorithm and a specific data set. We introduce a definition of perturbation robustness that directly addresses the underlying data.

Definition 7 (-multiplicative perturbation robustness of data).

A data set satisfies -multiplicative perturbation robustness with respect to clustering function and , if for any that is an -multiplicative perturbation of ,

Additive perturbation robustness of data is defined analogously.

This perspective at perturbation robustness raises a natural question: On what types of data are algorithms perturbation robust? Next, we explore the type of structures that allow popular cost-based paradigms and linkage-based methods to uncover meaningful clusters even when data is faulty.

4.1 Perturbation robustness of -means, -medoids, and min-sum

We begin our study of data-dependent perturbation robustness by considering cluster structures required for perturbation robustness of some of the most popular clustering functions: -means, -medoids and min-sum.

Recall that -means Steinley (2006) finds the clustering that minimizes where is the center of mass of cluster . An equivalent formulation that does not rely on centers of mass appears in Ostrovsky et al. (2006). A closely related clustering function is -medoids, where centers are required to be part of the data. Formally, the -medoids cost of is where is chosen to minimize the objective. Lastly, the min-sum Sahni and Gonzalez (1976) clustering function is the sum of all in-cluster distances, .

Many different notions of clusterability have been proposed in prior work Ackerman and Ben-David (2009); Ben-David (2015). Although they all aim to quantify the same tendency, it has been proven that notions of clusterability are often pairwise inconsistent Ackerman and Ben-David (2009). As such, care must be taken when selecting amongst them.

In order to analyze -means and related functions, we turn our attention to an intuitive cost-based notion, which requires that clusterings of near-optimal cost be structurally similar to the optimal solution. That is, this notion characterizes clusterable data as that which has a unique optimal solution in a strong sense, by excluding the possibility of having radically different clusterings of similar cost. See Figure 2 for an illustration.

This property, called “uniqueness of optimum”111This notion of clusterability appeared under several different names. The term “uniqueness of optimum” was coined by Ben-David Ben-David (2015). and closely related variations were investigated by Balcan et al. (2008a), Ostrovsky et al. (2006), Agarwal et al. (2013) and Ackerman et al. (2013), among others. See Balcan et al. (2008a) for a detailed exposition.

Definition 8 (Uniqueness of optimum).

Given a clustering function , a data set is -uniquely optimal if for every -clustering of where ,

We show that whenever data satisfies the uniqueness of optimum notion of clusterability, -means, -medoids, and min-sum are perturbation robust. Furthermore, the degree of robustness depends on the extent to which the data is clusterable.

Figure 2: An illustration of the uniqueness of optimum notion of clusterability for two clusters. Consider -means, -medoids, or min-sum. The highly-clusterable data depicted in (a) has a unique optimal solution, with no structurally different clusterings of near-optimal cost. In contrast, (b) displays data with two radically different clusterings of near-optimal cost, making this data poorly-clusterable for .

For the following proofs we will use to denote the cost of clustering with the distance function . We now show the relationships between uniqueness of optimum and perturbation robustness for -means.

Theorem 2.

Consider the -means clustering function and a data set . If is -uniquely optimal, then it is also -additive perturbation robust for all , where .

Proof.

Consider a data set , and let be any -additive perturbation of .

First, we argue that is close to . Let . First, note that . This is because finds the optimal solution on , and so the clustering it selects can only have lower or equal to cost than the cost of when evaluated with .

So, we calculate the -means cost of on . The -means objective function is equivalent to Ostrovsky et al. (2006). After an additive perturbation, any pairwise distance, , is bounded by . In addition, the contribution of any in-cluster pairwise distance to the total cost of the clustering is proportional to the magnitude of the distance. It therefore follows

(1a)
(1b)
(1c)
(1d)

By distributing the summation in the inequality 1d we come to:

(2a)
(2b)
(2c)

The first term, 2a, is equivalent to . We deal with the second term, 2b, by defining two sets and . To define , we first define . . Then . Similarly , and .

(3a)
(3b)

Because for all for all , , we can square the value in term 3a while only increasing the total value. Likewise, we can replace the value in term 3b with 1 while only increasing the total value.

(4a)
(4b)

Since and both consist of point pairs in and we are looking for an upper bound:

(5a)
(5b)

Note that is equivalent to . We can now return to the original inequality.

(6a)
(6b)
(6c)
(6d)

While we cannot know the size of individual clusters in the general case we do know is upper bounded by 1. Therefore we can substitute with 1 in terms 6c and 6d while only increasing the value. For the same reasons we do not know the number of in-cluster point pairs in the general case in terms 6c and 6d. However we do know the number of in-cluster point pairs is bounded by the total number of point pairs, namely which can be substituted in the same way while only increasing the value.

(7a)
(7b)
(7c)
(7d)

Therefore we know:

Then, , so . Similarly, , so where . So, . ∎

Theorem 3.

Consider the -means clustering function and a data set . If is -uniquely optimal, then it is also -multiplicative perturbation robust for all .

Proof.

Consider a data set , and let be any -multiplicative perturbation of .

First, we argue that is close to . Let . First, note that . This is because finds the optimal solution on , and so the clustering it selects can only have lower cost than the cost of when evaluated with .

So, we calculate the cost of on . The -means cost function is bounded by After a multiplicative perturbation, the contribution of an edge of length , which used to contribute at most to the cost of the function, contributes at most , and so the contribution increases by at most a factor of . .

Then, , so . ∎

The proofs for -medoid and min-sum follow similarly and are included in the appendix.

4.2 Perturbation robustness of Linkage-Based algorithms

We now move onto Linkage-Based algorithms, which in contrast to the methods studied in the previous section, do not seek to optimize an explicit objective function. Instead, they perform a series of merges, combining clusters according to their own measure of between-cluster distance.

Given clusters , the following are the between-cluster distances of some of the most popular Linkage-Based algorithms:

  • Single linkage:

  • Average linkage:

  • Complete linkage:

We consider Linkage-Based algorithms with the -stopping criterion, which terminate an algorithm when clusters remain, and return the resulting partitioning.

Because no explicit objective functions are used, we cannot rely on the uniqueness of optimum notion of clusterability. To define the type of cluster structure on which Linkage-Based algorithms exhibit perturbation robustness, we introduce a natural measure of clusterability based on a definition by Balcan et al Balcan et al. (2008b). The original notion required data to contain a clustering where every element is closer to all elements in its cluster than to all other points. This notion was also used in  Ackerman et al. (2012), Reyzin (2012), and Ackerman and Dasgupta (2014). See Figure 4 for an illustration.

Definition 9 (-strictly separable).

A data set is -strictly separable if there exists a unique clustering of so that for all and all , , .

The definition for -strictly additive separable is analogous.

Definition 10 (-Strictly Additive Separable).

A data set is -Strictly Additive Separable if there exists a unique clustering of so that for all and all , , .

Before moving on to our results for Linkage-Based algorithms, we show that the above notions of clusterability are not sufficient to show that data is perturbation robust for -means and similar methods. This indicates that different algorithms require different cluster structures in order to exhibit perturbation robustness. We show this results for -strictly separable data. The proof for -strictly additive separable data is in the supplementary material.

Figure 3: An illustration of the data set in the proof of Theorem 4 for . The original data consists of a dense region, “cloud” of points, and outliers. Each of the s, s and s consist of points. Before the perturbation, the data in the cloud is clustered , whereas after the perturbation it is partitioned as , leading to a large Hamming distance.
Theorem 4.

Let be any one of -means, -medoids, or min-sum. Then for any , there exists an -strictly separable data set on which is not -multiplicative perturbation robust.

Proof.

We construct such a data set such that there is one cloud of points densely packed together and singleton points far away from all other points. Further, in this construction and . The strictly separable clustering of will consist of the singleton points being separate clusters and the cloud being in one cluster.

Arrange the cloud of points such that the ratio of the largest to smallest in-cloud distance is less than , the ratio of the largest to smallest cloud point to singleton point distance is less than , and all points are separated such that splits the cloud evenly into separate clusters and the singleton points go into separate clusters.

Because the ratio between the largest and smallest in-cloud distance is less than , an -multiplicative perturbation can radically change the structure of the points in the cloud. If a point is identified by its distances to all other data, then perturbing the distance function can cause points to switch with one another. This ability to change applies similarly to the distance between cloud and singleton points. Because points can be arbitrarily made to act like other points, we perturb the data set such that the maximum number of in/between-cluster relationships are changed (with the restriction of never switching points from being in the cloud to being a singleton point and vice versa, because a multiplicative perturbation cannot necessarily make this switch).

We maximize the possible Hamming distance under the previous assumptions by constructing the following two data sets: First, divide the points of the cloud evenly into clusters. This is our first clustering. Then taking that clustering, re-cluster the points by grouping the points in each cluster into groups of . Finally, form the new clusters by selecting one group from each previous cluster to be in a new cluster. See figure  3 for an example.

To find the Hamming distance between these two clusterings we first find the number of pairwise relationships that were formerly between-cluster that are now in-cluster. First, remember that a group contains points and there will be group pairs that were formerly between-cluster that are now in cluster. Next, each point in a group will contribute to the Hamming distance per group pair. This gives the amount contributed to the Hamming distance by points relationships that were formerly between-cluster that are now in-cluster as .

We now find the number of pairwise dissimilarities that were formerly in-cluster that are now between-cluster. Similar to before, each group contains points and there will be groups that were formerly in-cluster that are now between-cluster. This gives the total Hamming distance as , which reduces to .

Figure 4: An example of strictly separable data. Note that it may include clusters with different diameters, as long as the dissimilarity between any two clusters scales as the larger diameter of the two.

We now show that whenever data is strictly separable, then it is also perturbation robust with respect to some of the most popular Linkage-Based algorithms. An analogous result for additive perturbation robustness appears in the supplementary material.

Theorem 5.

Single-Linkage, Average-Linkage, and Complete-Linkage are -multiplicative perturbation robust on all -strictly separable data sets.

Proof.

We begin by showing that whenever data is -strictly separable, then these Linkage-Based algorithms identify the underlying cluster structure. This result was previously shown for Single-Linkage Balcan et al. (2008b) and Average-Linkage Ackerman et al. (2012). We now prove this for complete-linkage.

First, we introduce the concept of a refinement. A clustering is a refinement of clustering if can be obtained by merging clusters in . The proof proceeds by induction on the number of iterations, showing that at each step of the algorithm, the current clustering is a refinement of the strictly separable -clustering . Since Linkage-Based algorithms start by placing each point in its own cluster, the clustering formed in the first step is a refinement of . Assuming that the hypothesis holds at step of the algorithm, we show that it is retained in the following step. Consider any and that are a subset of the same cluster in , and any that is a subset of a different cluster in . Let . Then, the dissimilarity between and any point in is greater than since the data is -strictly separable. Then Complete-Linkage merges with before merging with .

Lastly, observe that data that is -strictly separable is also -strictly separable, and remains so after an -perturbation, as shown in Lemma 1 in the appendix. It follows that single, average, and complete linkage are -Multiplicative Perturbation Robust on -strictly separable data. ∎

5 Conclusions

As a property of an algorithm, perturbation robustness fails in a strong sense, contradicting even more fundamental requirements of clustering functions. As such, no algorithm can exhibit this desirable characteristic on all data sets. Notably, this result persists even if we allow two-thirds of all pairwise distance to change following a perturbation.

However, a more optimistic picture emerges when considering clusterable data, and we show that popular paradigms are able to discover some cluster structures even on faulty data. Further, different clustering techniques are perturbation robust on different cluster structures. This has important implications for the “user’s dilemma,” which is the problem of selecting a suitable clustering algorithm for a given task. Faced with the challenge of clustering data with imprecise dissimilarities between pairwise entities, a user cannot simply elect to apply a perturbation robust technique as no such methods exist, and as such the selection of suitable methods calls for some insight on the underlying structure of the data.

Future work will investigate robustness of heuristics, such as Lloyd’s method, for which preliminary analysis suggests that the cluster structure required for perturbation robustness depends on the method of initialization.

6 Impossibility theorems for

In prior work, perturbation robustness is often defined with , requiring that the clustering remain unchanged after a perturbation. Unfortunately, as a property of an algorithm, this formulation fails in a strong sense. That is, we show that no reasonable clustering function can satisfy this condition without effectively ignoring all pairwise distances, by outputting the same partitioning irrespective of the setting of .

Specifically, we prove that both additive and multiplicative perturbation robustness with contradicts -Richness, which is a relaxation Kleinberg’s Richness axiom. This property requires clustering functions to be at least minimally responsive to pairwise dissimilarity. That is, given complete freedom to reassign all dissimilarities, we should be able to change the output of the function. This basic property is satisfied by all reasonable clustering methods Ackerman et al. (2010b).

Let denote the set of all clusterings so that for some dissimilarity function .

Definition 11 (2-Richness).

For all ,

We now prove that both additive and multiplicative perturbation robustness (with ) are inconsistent with -Richness.

Theorem 6.

No clustering function is both 2-Rich and -Additive Perturbation Robust for any .

Proof.

Let be any -Rich clustering function. Then for any domain set , there exist dissimilarity functions and so that . Observe that we can transform into by making incremental changes, each changing the dissimilarity function by no more than on each pairwise dissimilarity. Then, by -Perturbation-Robustness, , contradicting the previous claim. ∎

Now we prove the analogous result for multiplicative perturbation robustness.

Theorem 7.

No clustering function is both -Rich and -Multiplicative Perturbation Robust for any .

Proof.

Let be any -Rich, Isomorphic Invariant, and -Multiplicative Perturbation Robust clustering function. Then for any domain set , there exist dissimilarity functions and so that . Observe we can transform into through a series of -multiplicative perturbations, changing each pairwise distance by a factor of or less with each perturbation.

A contradiction is therefore achieved. By -Multiplicative Perturbation Robustness , and by -Richness . ∎

The above results implies that perturbation robustness with is too stringent a requirement for clustering functions.

7 Proof of Lemma 1

Lemma 1.

The minimum separation a data set that is -strictly separable can be after an -multiplicative perturbation is -strictly separable.

Proof.

If a data set is -strictly separable, the minimum between-cluster dissimilarity must be greater than the maximum in-cluster dissimilarity by a factor of , .

An -multiplicative perturbation can change any pairwise distance by at most a factor of , because the multiplication is applied to all arguments and is outside of the distance function this simplifies to .

Therefore . This is equivalent to the definition of -strictly separable. ∎

8 Excluded multiplicative perturbation proofs

Theorem 8.

Consider the -medoids clustering function and a data set . If is -uniquely optimal, then it is also -multiplicative perturbation robust for all .

Proof.

Consider a data set , and let be any -multiplicative perturbation of .

First, we argue that is close to . Let First, note that . This is because finds the optimal solution on , and so the clustering it selects can only have lower cost than the cost of . We also compute the cost of each element to the same cluster centers as in , as it provides an upper bound on . So, So, , so . ∎

Theorem 9.

Given the objective, a data set that is -uniquely optimal is also -multiplicative perturbation robust.

Proof.

Let be the optimal clustering of an arbitrary data set.

Let be any -multiplicative perturbation of . Therefore is at worst:

Because , . Therefore the maximum change an -multiplicative perturbation can produce approaches . ∎

9 Equivalent results for additive perturbation

We now prove the results in the main paper but for additive instead of multiplicative perturbation.

First, recall the definition of additive perturbation.

Definition 12 (-additive perturbation of a dissimilarity function).

Given a pair of dissimilarity functions and over a domain , is an -additive perturbation of , for , if for all , .

Definition 13 (-Additive Perturbation Robust Clustering Function).

A clustering function is -Additive-Perturbation Robust, if for any that is an -additive-perturbation of ,

Theorem 10.

Consider the -medoids clustering function. If is -uniquely optimal, then it is also -additive perturbation robust for all .

Proof.

Consider a data set , and let be any -additive perturbation of .

First, we argue that is close to . Let First, note that . This is because finds the optimal solution on , and so the clustering it selects can only have lower cost than the cost of . We also compute the cost of each element to the same cluster centers as in , as it provides an upper bound on . So, So, , so . ∎

Theorem 11.

Given the objective, -uniquely optimal data sets are -additive perturbation robust.

Proof.

Let be the optimal -means clustering of an an arbitrary data set.

Let be any -perturbation of . Therefore is at most:

Because , . Therefore the maximum change an -perturbation can produce approaches . ∎

Definition 14 (-Strictly Additive Separable).

A data set is -Strictly Additive Separable if there exists a unique clustering of so that for all and all , , .

Theorem 12.

Let be any one of -means, -medoids, or min-sum. Then for any , there exists an -strictly-additive separable data set on which is not -additive perturbation robust.

Proof.

With respect to -additive perturbations:

Taking the same base data set as before we now set the cloud to be sufficiently dense, and the singleton points to be organized so that merges all but the large cluster, and as such the final clustering depends nearly entirely on the internal structure of the large cluster.

Arrange the the data set to consist of dissimilarities smaller than , so an -perturbation of can radically alter the output of . In particular, we can arrange the large cluster so that subdivides it into equal size groups, and after the perturbation all points in that cluster form a single cluster by moving the points in that cluster very close together.

We now compute the distance between these two clusterings. The original clustering, looking only at the data in the large cluster, has between-cluster edges, all of which become incluster edges after the perturbation. So, the two clusterings differ by at least