1 Introduction
uniform density data 


(a) Kmeans  (b) kernel Kmeans  
nonuniform data 

(c) kernel Kmeans  (d) kernel clustering  
(Breiman’s bias, mode isolation)  (adaptive weights or kernels) 
In machine learning,
kernel clustering is a well established data analysis technique [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] that can identify nonlinearly separable structures, see Figure 1(ab). Section 1.1 reviews the kernel Kmeans and related clustering objectives, some of which have theoretically explained biases, see Section 1.2. In particular, Section 1.2.2 describes the discrete Gini clustering criterion standard in decision tree learning where Breiman [11] proved a bias to histogram mode isolation.Empirically, it is well known that kernel Kmeans or average association (see Section 1.1.1) has a bias to socalled “tight” clusters for small bandwidths [3]. Figure 1(c) demonstrates this bias on a nonuniform modification of a typical toy example for kernel Kmeans with common Gaussian kernel
(1) 
This paper shows in Section 2 that under certain conditions kernel Kmeans approximates the continuous generalization of the Gini criterion where we formally prove a mode isolation bias similar to the discrete case analyzed by Breiman. Thus, we refer to the “tight” clusters in kernel Kmeans as Breiman’s bias.
We propose a density equalization principle directly addressing the cause of Breiman’s bias. First, Section 3 discusses modification of the density with adaptive point weights. Then, Section 4 shows that a general class of locally adaptive geodesic kernels [10] implicitly transforms data and modifies its density. We derive “density laws” relating adaptive weights and kernels to density transformations. They allow to implement density equalization resolving Breiman’s bias, see Figure 1
(d). One popular heuristic
[12] approximates a special case of our Riemannian kernels.Besides mode isolation, kernel clustering may have the opposite density bias, e.g. sparse subsets in Normalized Cut [3], see Figure 9(a). Section 5 presents “normalization” as implicit density inversion establishing a formal relation between sparse subsets and Breiman’s bias. Equalization addresses any density biases. Interestingly, density equalization makes many standard kernel clustering criteria conceptually equivalent, see Section 6.
(a) Breiman’s bias  (b) good clustering 
Example of Breiman’s bias on real data. Feature vectors are 3dimensional LAB colours corresponding to image pixels. Clustering results are shown in two ways. First,
red and blue show different clusters inside LAB space. Second, pixels with colours in the “background” (red) cluster are removed from the original image. (a) shows the result for kernel Kmeans with a fixedwidth Gaussian kernel isolating a small dense group of pixels from the rest. (b) shows the result for an adaptive kernel, see Section 4.1.1 Kernel Kmeans
A popular data clustering technique, kernel Kmeans [1] is a generalization of the basic Kmeans method. Assuming denotes a finite set of points and is a feature (vector) for point , the basic Kmeans minimizes the sum of squared errors within clusters, that is, distances from points in each cluster to the cluster means
(2) 
Instead of clustering data points in their original space, kernel Kmeans uses mapping embedding input data as points in a higherdimensional Hilbert space . Kernel Kmeans minimizes the sum of squared errors in the embedding space corresponding to the following (mixed) objective function
(3) 
where is a partitioning (clustering) of into clusters, is a set of parameters for the clusters, and denotes the Hilbertian norm^{1}^{1}1 Our later examples use finitedimensional embeddings where is an Euclidean space () and is the Euclidean norm.
. Kernel Kmeans finds clusters separated by hyperplanes in
. In general, these hyperplanes correspond to nonlinear surfaces in the original input space . In contrast to (3), standard Kmeans objective (2) is able to identify only linearly separable clusters in .Optimizing with respect to the parameters yields closedform solutions corresponding to the cluster means in the embedding space:
(4) 
where denotes the cardinality (number of points) in a cluster. Plugging optimal means (4) into objective (3) yields a highorder function, which depends solely on the partition variable :
(5) 
Expanding the Euclidean distances in (5), one can obtain an equivalent pairwise clustering criterion expressed solely in terms of inner products in the embedding space :
(6) 
where means equality up to an additive constant. The inner product is often replaced with kernel , a symmetric function:
(7) 
Then, kernel Kmeans objective (5) can be presented as
(8) 
Formulation (8) enables optimization in highdimensional space that only uses kernel computation and does not require computing the embedding . Given a kernel function, one can use the kernel Kmeans without knowing the corresponding embedding. However, not any symmetric function corresponds to the inner product in some space. Mercer’s theorem [2] states that any positive semidefinite (p.s.d.) kernel function can be expressed as an inner product in a higherdimensional space. While p.s.d. is a common assumption for kernels, pairwise clustering objective (8) is often extended beyond p.s.d. affinities. There are many other extension of kernel Kmeans criterion (8). Despite the connection to density modes made in our paper, kernel clustering has only a weak relation to meanshift [13], e.g. see [14].
1.1.1 Related graph clustering criteria
Positive semidefinite kernel in (8
) can be replaced by an arbitrary pairwise similarity or affinity matrix
. This yields the average association criterion, which is known in the context of graph clustering [3, 15, 7]:(9) 
The standard kernel Kmeans algorithm [7, 9] is not guaranteed to decrease (9) for improper (non p.s.d.) kernel . However, [15] showed that dropping p.s.d. assumption is not essential: for arbitrary association there is a p.s.d. kernel such that objective (8) is equivalent to (9) up to a constant.
In [3] authors experimentally observed that the average association (9) or kernel Kmeans (8) objectives have a bias to separate small dense group of data points from the rest, e.g. see Figure 2.
Besides average association, there are other pairwise graph clustering criteria related to kernel Kmeans. Normalized cut
is a common objective in the context of spectral clustering
[3, 16]. It optimizes the following objective(10) 
where . Note that for equation (10) reduces to (9). It is known that Normalized cut objective is equivalent to a weighted version of kernel Kmeans criterion [17, 7].
1.1.2 Probabilistic interpretation via kernel densities
Besides kernel clustering, kernels are also commonly used for probability density estimation. This section relates these two independent problems. Standard
multivariate kernel density estimate
or Parzen density estimate for the distribution of data points within cluster can be expressed as follows [18]:(11) 
with kernel having the form:
(12) 
where is a symmetric multivariate density and is a symmetric positive definite bandwidth matrix controlling the density estimator’s smoothness. One standard example is the Gaussian (normal) kernel (1) corresponding to
(13) 
which is commonly used both in kernel density estimation [18] and kernel clustering [6, 3].
The choice of bandwidth is crucial for accurate density estimation, while the choice of plays only a minor role [19]. There are numerous works regarding kernel selection for accurate density estimation using either fixed [20, 19, 21] or variable bandwidth [22]. For example, Scott’s rule of thumb is
(14) 
where is the number of points, and
is the variance of the
th feature that could be interpreted as the range or scale of the data. Scott’s rule gives optimal mean integrated squared error for normal data distribution, but in practice it works well in more general settings. In all cases the optimal bandwidth for sufficiently large datasets is a small fraction of the data range [23, 18]. For shortness, we use adjective small to describe bandwidths providing accurate density estimation.1.2 Other clustering criteria and their known biases
One of the goals of this paper is a theoretical explanation for the bias of kernel Kmeans with small bandwidths toward tight dense clusters, which we call Breiman’s bias, see Figs 12. This bias was observed in the past only empirically. As discussed in Section 4.1, large bandwidth reduces kernel Kmeans to basic Kmeans where bias to equal cardinality clusters is known [24]. This section reviews other standard clustering objectives, entropy and Gini criteria, that have biases already wellunderstood theoretically. In Section 2 we establish a connection between Gini clustering and kernel Kmeans in case of small kernels. This connection allows theoretical analysis of Breiman’s bias in kernel Kmeans.
1.2.1 Probabilistic Kmeans and entropy criterion
Besides nonparametric kernel Kmeans clustering there are wellknown parametric extensions of basic Kmeans (2) based on probability models. Probabilistic Kmeans [24] or model based clustering [25] use some given likelihood functions instead of distances in (2) as in clustering objective
(16) 
Note that objective (16) reduces to basic Kmeans (2) for Gaussian probability model with mean and a fixed scalar covariance matrix.
In probabilistic Kmeans (16) models can differ from Gaussians depending on a priori assumptions about the data in each cluster, e.g.
gamma, Gibbs, or other distributions can be used. For more complex data, each cluster can be described by highlydescriptive parametric models such as Gaussian mixtures (GMM). Instead of kernel density estimates in kernel Kmeans (
15), probabilistic Kmeans (16) uses parametric distribution models. Another difference is the absence of the in (15) compared to (16).The analysis in [24] shows that in case of highly descriptive model , e.g. GMM or histograms, (16) can be approximated by the standard entropy criterion for clustering:
(17) 
where is the entropy of the distribution of the data in :
The discrete version of the entropy criterion is widely used for learning binary decision trees in classification [11, 18, 26]. It is known that the entropy criterion above is biased toward equal size clusters [11, 24, 27].
1.2.2 Discrete Gini impurity and criterion
Both Gini and entropy clustering criteria are widely used in the context of decision trees [18, 26]. These criteria are used to decide the best split at a given node of a binary classification tree [28]. The Gini criterion can be written for clustering as
(18) 
where is the Gini impurity for the points in . Assuming discrete feature space instead of , the Gini impurity is
(19) 
where is the empirical probability (histogram) of discretevalued features in cluster .
Similarly to the entropy, Gini impurity can be viewed as a measure of sparsity or “peakedness” of the distribution for points in . Note that (18) has a form similar to the entropy criterion in (17), except that entropy is replaced by the Gini impurity. Breiman [11] analyzed the theoretical properties of the discrete Gini criterion (18) when are discrete histograms. He proved
2 Breiman’s bias (numerical features)
In this section we show that the kernel Kmeans objective reduces to a novel continuous Gini criterion under some general conditions on the kernel function, see Section 2.1. We formally prove in Section 2.2 that the optimum of the continuous Gini criterion isolates the data density mode. That is, we show that the discussed earlier biases observed in the context of clustering [3] and decision tree learning [11] are the same phenomena. Section 2.3 establishes connection to maximum cliques [29] and dominant sets [8].
For further analysis we reformulate the problem of clustering a discrete set of points , see Section 1.1, as a continuous domain clustering problem. Let be a probability measure over domain and
be the corresponding continuous probability density function such that the discrete points
could be treated as samples from this distribution. The clustering of the continuous domain will be described by an assignment function . Density implies conditional probability densities . Feature points in cluster could be interpreted as a sample from conditional density .Then, the continuous clustering problem is to find an assignment function optimizing a clustering criteria. For example, we can analogously to (18) define continuous Gini clustering criterion
(20) 
where is the probability to draw a point from th cluster and
(21) 
In the next section we show that kernel Kmeans energy (15) can be approximated by continuous Giniclustering criterion (20) for small kernels.
2.1 Kernel Kmeans and continuous Gini criterion
To establish the connection between kernel clustering and the Gini criterion, let us first recall MonteCarlo estimation [24], which yields the following expectationbased approximation for a continuous function and cluster :
(22) 
where is the “true” continuous density of features in cluster . Using (22) for and , we can approximate the kernel density formulation in (15) by its expectation
(23) 
Note that partition is determined by dataset and assignment function . We also assume
(24) 
This is essentially an assumption on kernel bandwidth. That is, we assume that kernel bandwidth gives accurate density estimation. For shortness, we call such bandwidths small, see Section 1.1.2. Then (23) reduces to approximation
(25) 
Additional application of MonteCarlo estimation allows replacing set cardinality by probability of drawing a point from . This results in continuous Gini clustering criterion (20), which approximates (15) or (8) up to an additive and positive multiplicative constants.
Next section proves that the continuous Gini criterion (20) has a similar bias observed by Breiman in the discrete case.
2.2 Breiman’s bias in continuous Gini criterion
This section extends Theorem 1 to continuous Gini criterion (20). Since Section 2.1 has already established a close relation between continuous Gini criterion and kernel Kmeans for small bandwidth kernels, then Breiman’s bias also applies to the latter. For simplicity, we focus on as in Breiman’s Theorem 1.
Theorem 2 (Breiman’s bias in continuous case).
For the continuous Gini clustering criterion (20) achieves its optimal value at the partitioning of into regions
Proof.
The statement follows from Lemma 2 below. ∎
We denote mathematical expectation of function
Minimization of (20) corresponds to maximization of the following objective function
(26) 
where the probability to draw a point from cluster is
where is the indicator function. Note that mixed joint density
allows to write conditional density in (26) as
(27) 
(28) 
Introducing notation
allows to further rewrite objective function as
(29) 
Without loss of generality assume that (the opposite case would yield a similar result). We now need following
Lemma 1.
Let be some positive numbers, then
Proof.
Use reduction to a common denominator. ∎
Lemma 2.
Assume that function is
(31) 
Then
(32) 
Proof.
Due to monotonicity of expectation we have
(33) 
(34) 
That is, the right part of (32) is an upper bound for .
This result states that the optimal assignment function separates the mode of the density function from the rest of the data. The proof considers case for continuous Gini criterion approximating kernel Kmeans for small kernels. The multicluster version for also has Breiman’s bias. Indeed, it is easy to show that any two clusters in the optimal solution shall give optimum of objective (20). Then, these two clusters are also subject to Breiman’s bias. See a multicluster example in Figure 3.
Practical considerations: While Theorem 2 suggests that the isolated density mode should be a single point, in practice Breiman’s bias in kernel kmeans isolates a slightly wider cluster around the mode, see Figures 2, 3, 7(ad), 8. Indeed, Breiman’s bias holds for kernel kmeans when the assumptions in Section 2.1 are valid. In practice, shrinking of the clusters invalidates approximations (23) and (24) preventing the collapse of the clusters.
2.3 Connection to maximal cliques and dominant sets
Interestingly, there is also a relation between maximum cliques and density modes. Assume  kernel with bandwidth . Then, kernel matrix is a connectivity matrix corresponding to a disk graph. Intuitively, the maximum clique on this graph should be inside a disk with the largest number of points in it, which corresponds to the density mode.
Formally, mode isolation bias can be linked to both maximum clique and its weightedgraph generalization, dominant set [8]. It is known that maximum clique [29] and dominant set [8] solve a tworegion clustering problem with energy
(39) 
corresponding to average association (9) for and . Under the same assumptions as above, Gini impurity (21) can be used as an approximation reducing objective (39) to
(40) 
Using (33) and (37) we can conclude that the optimum of (40) isolates the mode of density function . Thus, clustering minimizing (39) for small bandwidths also has Breiman’s bias. That is, for such bandwidths the concepts of maximum clique and dominant set for graphs correspond to the concept of mode isolation for data densities. Dominant sets for the examples in Figures 1(c), 2(a), and 7(d) would be similar to the shown modeisolating solutions.
(a) density  (b) Gaussian kernel, 2 clusters 
(c) Gaussian kernel, 4 clusters  (d) kernel, 4 clusters 
. The last fully connected layer of the neural network in
[31] gives 4096dimensional feature vector for each image. We reduce the dimension to 5 via PCA. For visualization purposes, we obtain 3D embeddings via MDS [32]. (a) Kernel densities estimates for data points are colorcoded: darker points correspond to higher density. (b,c) The result of the kernel Kmeans with the Gaussian kernel (1). Scott’s rule of thumb defines the bandwidth. Breiman’s bias causes poor clustering, i.e. small cluster is formed in the densest part of the data in (b), three clusters occupy few points within densest regions while the fourth cluster contains 71% of the data in (c). The normalized mutual information (NMI) in (c) is 0.38. (d) Good clustering produced by kernel (Example 3) gives NMI of 0.90, which is slightly better than the basic Kmeans (0.89).3 Adaptive weights solving Breiman’s bias
We can use a simple modification of average association by introducing weights for each point “error” within the equivalent kernel Kmeans objective (3)
(41) 
Such weighting is common for Kmeans [23]. Similarly to Section 1.1 we can expand the Euclidean distances in (41) to obtain an equivalent weighted average association criterion generalizing (9)
(42) 
Weights have an obvious interpretation based on (41); they change the data by replicating each point by a number of points in the same location (Figure 4a) in proportion to . Therefore, this weighted formulation directly modifies the data density as
(43) 
where and are respectively the densities of the original and the new (replicated) points. The choice of is a simple way for equalizing data density to solve Breiman’s bias. As shown in Figure 4(a), such a choice enables lowdensity points to be replicated more frequently than highdensity ones. This is one of density equalization approaches giving the solution in Figure 1(d).
 original data  replicated data   original data  transformed data 
(a) adaptive weights (Sec. 3)  (b) adaptive kernels (Sec. 4.3) 
4 Adaptive kernels solving Breiman’s bias
Breiman’s bias in kernel Kmeans is specific to small bandwidths. Thus, it has direct implications for the bandwidth selection problem discussed in this section. Note that kernel bandwidth selection for clustering should not be confused with kernel bandwidth selection for density estimation, an entirely different problem outlined in Section 1.1.2. In fact, small bandwidths give accurate density estimation, but yield poor clustering due to Breiman’s bias. Larger bandwidths can avoid this bias in clustering. However, Section 4.1 shows that for extremely large bandwidths kernel Kmeans reduces to standard Kmeans, which loses ability of nonlinear cluster separation and has a different bias to equal cardinality clusters [24, 27].
In practice, avoiding extreme bandwidths is problematic since the notions of small and large strongly depend on data properties that may significantly vary across the domain, e.g. in Figure 1c,d where no fixed bandwidth gives a reasonable separation. This motivates locally adaptive strategies. Interestingly, Section 4.2 shows that any locally adaptive bandwidth strategy implicitly corresponds to some data embedding deforming density of the points. That is, locally adaptive selection of bandwidth is equivalent to selection of density transformation. Local kernel bandwidth and transformed density are related via the density law established in (59). As we already know from Theorem 2, Breiman’s bias is caused by high nonuniformity of the data, which can be addressed by density equalizing transformations. Section 4.3 proposes adaptive kernel strategies based on our density law and motivated by a density equalization principle addressing Breiman’s bias. In fact, a popular locally adaptive kernel in [12] is a special case of our density equalization principle.
4.1 Overview of extreme bandwidth cases
Section 2.1 and Theorem 2 prove that for small bandwidths the kernel Kmeans is biased toward “tight” clusters, as illustrated in Figures 1, 2 and 7(d). As bandwidth increases, continuous kernel density (11) no longer approximates the true distribution violating (24). Thus, Gini criterion (25) is no longer valid as an approximation for kernel Kmeans objective (15). In practice, Breiman’s bias disappears gradually as bandwidth gets larger. This is also consistent with experimental comparison of smaller and larger bandwidths in [3].
The other extreme case of bandwidth for kernel Kmeans comes from its reduction to basic Kmeans for large kernels. For simplicity, assume Gaussian kernels (1) of large bandwidth approaching data diameter. Then the kernel can be approximated by its Taylor expansion and kernel Kmeans objective (8) for becomes^{2}^{2}2Relation (44) easily follows by substituting . (up to a constant)
(44) 
which is equivalent to basic Kmeans (2) for any fixed .
Figure 5 summarizes kernel Kmeans biases for different bandwidths. For large bandwidths the kernel Kmeans loses its ability to find nonlinear cluster separation due to reduction to the basic Kmeans. Moreover, it inherits the bias to equal cardinality clusters, which is wellknown for the basic Kmeans [24, 27]. On the other hand, for small bandwidths kernel Kmeans has Breiman’s bias proven in Section 2. To avoid the biases in Figure 5, kernel Kmeans should use a bandwidth neither too small nor too large. This motivates locally adaptive bandwidths.
4.2 Adaptive kernels as density transformation
This section shows that kernel clustering (8) with any locally adaptive bandwidth strategy satisfying some reasonable assumptions is equivalent to fixed bandwidth kernel clustering in a new feature space (Theorem 3) with a deformed point density. The adaptive bandwidths relate to density transformations via density law (59). To derive it, we interpret adaptiveness as nonuniform variation of distances across the feature space. In particular, we use a general concept of geodesic kernel
defining adaptiveness via a metric tensor and illustrate it by simple practical examples.
Our analysis of Breiman’s bias in Section 2 applies to general kernels (12) suitable for density estimation. Here we focus on clustering with kernels based on radial basis functions s.t.
(45) 
To obtain adaptive kernels, we replace Euclidean metric with Riemannian inside (45). In particular, is replaced with geodesic distances between features based on any given metric tensor for . This allows to define a geodesic or Riemannian kernel at any points and as in [10]
(46) 
where is introduced for shortness.
In practice, the metric tensor can be defined only at the data points for . Often, quickly decaying radial basis functions allow Mahalanobis distance approximation inside (46)
(47) 
which is normally valid only in a small neighborhood of . If necessary, one can use more accurate approximations for based on Dijkstra [33] or Fast Marching method [34].
Example 1 (Adaptive nonnormalized^{3}^{3}3 Lack of normalization as in (48) is critical for density equalization resolving Breiman’s bias, which is our only goal for adaptive kernels. Note that without kernel normalization as in (12) Parzen density formulation of kernel kmeans (15) no longer holds invalidating the relation to Gini and Breiman’s bias in Section 2. On the contrary, normalized variable kernels are appropriate for density estimation [22] validating (15). They can also make approximation (24) more accurate strengthening connections to Gini and Breiman’s bias. Gaussian kernel).
Mahalanobis distances based on (adaptive) bandwidth matrices defined at each point can be used to define adaptive kernel
(48) 
which equals fixed bandwidth Gaussian kernel (1) for . Kernel (48) approximates (46) for exponential function in (13) and tensor continuously extending matrices over the whole feature space so that for . Indeed, assuming matrices and tensor change slowly between points within bandwidth neighbourhoods, one can use (47) for all points in
(49) 
due to exponential decay outside the bandwidth neighbourhoods.
Example 2 (ZelnikManor & Perona kernel [12]).
This popular kernel is defined as . This kernel’s relation to (46) is less intuitive due to the lack of “local” Riemannian tensor. However, under assumptions similar to those in (49), it can still be seen as an approximation of geodesic kernel (46) for some tensor such that for . They use heuristic , which is the distance to the Kth nearest neighbour of .
Example 3 (KNN kernel).
This adaptive kernel is defined as where is the set of nearest neighbors of . This kernel approximates (46) for uniform function and tensor such that .
(a) space of points  (b) transformed points 

with Riemannian metric  with Euclidean metric 
unit balls in Riemannian metric  unit balls in Euclidean metric 
Theorem 3.
Proof.
A powerful general result in [35, 36, 15] states that for any symmetric matrix with zeros on the diagonal there is a constant such that squared distances
(50) 
form Euclidean matrix . That is, there exists some Euclidean embedding where for there corresponds a point such that , see Figure 6. Therefore,
(51) 
for and . ∎
Theorem 3 proves that adaptive kernels for can be equivalently replaced by a fixed bandwidth kernel for some implicit embedding^{4}^{4}4The implicit embedding implied by Euclidean matrix (50) should not be confused with embedding in the Mercer’s theorem for kernel methods. in a new space. Below we establish a relation between three local properties at point : adaptive bandwidth represented by matrix and two densities and in the original and the new feature spaces. For consider an ellipsoid in the original space , see Figure 6(a),
(52) 
Assuming is small enough so that approximation (47) holds, ellipsoid (52) covers features for subset of points
(53) 
Similarly, consider a ball in the new space , see Figure 6(b),
(54) 
covering features for points
(55) 
It is easy to see that (50) implies . Let and be the densities^{5}^{5}5We use the physical rather than probability density. They differ by a factor. of points within and correspondingly. Assuming denotes volumes or cardinalities of sets, we have
(56) 
Omitting a constant factor depending on , , and we get
(57) 
representing the general form of the density law. For the basic isotropic metric tensor such that it simplifies to
(58) 
Thus, bandwidth can be selected adaptively based on any desired transformation of density using
(59) 
where observed density in the original feature space can be evaluated at any point using any standard estimators, e.g. (11).
4.3 Density equalizing locally adaptive kernels
Bandwidth formula (59) works for any density transform . To address Breiman’s bias, one can use density equalizing transforms or , which even up
the highly dense parts of the feature space as illustrated on the right. Some empirical results using density equalization for synthetic and real data are shown in Figures 1(d) and 7(e,f).
using fixed width kernel  using adaptive kernel  
(a) input image  (b) 2D color histogram  (e) density mapping 
(c) clustering result  (d) color coded result  (f) clustering result 
One way to estimate the density in (59) is approach [18]
(60) 
where is the size of the dataset, is the distance to the th nearest neighbor of , is the volume of a ball of radius centered at . Then, density law (59) for gives
(61) 
The result in Figure 1(d) uses adaptive Gaussian kernel (48) for with derived in (61). Theorem 3 claims equivalence to a fixed bandwidth kernel in some transformed higherdimensional space . Bandwidths (61) are chosen specifically to equalize the data density in this space so that .
The picture on the right illustrates such density equalization for the data in Figure 1(d). It shows a 3D projection of the transformed data obtained by multidimensional scaling [32] for matrix in (50). The observed density equalization removes Breiman’s bias from the clustering in Figure 1(d).
Real data experiments for kernels with adaptive bandwidth (61) are reported in Figures 2, 3, 7, 8 and Table I. Figure 7(e) illustrates the empirical density equalization effect for this bandwidth. Such data homogenization removes the conditions leading to Breiman’s bias, see Theorem 2. Also, we observe empirically that kernel is competitive with adaptive Gaussian kernels, but its sparsity gives efficiency and simplicity of implementation.
regularization  average error, %  







none  20.4  17.6  12.2  12.4  
Euclidean length  15.1  16.0  10.2  11.0  
contrastsensitive  9.7  13.8  7.1  7.8 
5 Normalized Cut and Breiman’s bias
Breiman’s bias for kernel Kmeans criterion (8), a.k.a. average association (AA) (9), was empirically identified in [3], but our Theorem 2 is its first theoretical explanation. This bias was the main critique against AA in [3]. They also criticize graph cut [40] that “favors cutting small sets of isolated nodes”. These critiques are used to motivate normalized cut (NC) criterion (10) aiming at balanced clustering without “clumping” or “splitting”.
We do not obeserve any evidence of the mode isolation bias in NC. However, Section 5.1 demonstrates that NC still has a bias to isolating sparse subsets. Moreover, using the general density analysis approach introduced in Section 4.2 we also show in Section 5.2 that normalization implicitly corresponds to some densityinverting embedding of the data. Thus, mode isolation (Breiman’s bias) in this implicit embedding corresponds to the sparse subset bias of NC in the original data.
5.1 Sparse subset bias in Normalized Cut
The normalization in NC does not fully remove the bias to small isolated subsets and it is easy to find examples of “splitting” for weakly connected nodes, see Figure 9(a). The motivation argument for the NC objective below Fig.1 in [3] implicitly assumes similarity matrices with zero diagonal, which excludes many common similarities like Gaussian kernel (1). Moreover, their argument is built specifically for an example with a single isolated point, while an isolated pair of points will have a nearzero NC cost even for zero diagonal similarities.
Intuitively, this NC issue can be interpreted as a bias to the “sparsest” subset (Figure 9a), the opposite of AA’s bias to the “densest” subset, i.e. Breiman’s bias (Figure 1c). The next subsection discusses the relation between these opposite biases in detail. In any case, both of these density inhomogeneity problems in NC and AA are directly addressed by our density equalization principle embodied in adaptive weights in Section 3 or in the locally adaptive kernels derived in Section 4.3. Indeed, the result in Figure 1(d) can be replicated with NC using such adaptive kernel. Interestingly, [12] observed another data nonhomogeneity problem in NC different from the sparse subset bias in Figure 9(a), but suggested a similar adaptive kernel as a heuristic solving it.
(a) NC for smaller bandwidth  (b) NC for larger bandwidth 
(bias to “sparsest” subsets)  (loss of nonlinear separation) 
5.2 Normalization as density inversion
The bias to sparse clusters in NC with small bandwidths (Figure 9a) seems the opposite of mode isolation in AA (Figure 1c). Here we show that this observation is not a coincidence since NC can be reduced to AA after some densityinverting data transformation. While it is known [17, 7] that NC is equivalent to weighted kernel Kmeans (i.e. weighted AA) with some modified affinity, this section relates such kernel modification to an implicit densityinverting embedding where mode isolation (Breiman’s bias) corresponds to sparse clusters in the original data.
First, consider standard weighted AA objective for any given affinity/kernel matrix as in (42)
Clearly, weights based on node degrees and “normalized” affinities