1 Introduction
As one of the most fundamental problems in machine learning, clustering has received a considerable amount of attention and has applications in data mining, computer vision, statistics, social sciences, and others. Spectral graph cut algorithms such as normalized cuts
[1], ratio cut [2] and ratio association [1, 3] are one of the most studied and utilized classes of clustering methods. These algorithms aim to cluster data by first constructing a similarity graph based on the given data, then “cutting” the graph into groups of nodes according to a graphtheoretic objective. Normalized cuts has been widely used in the computer vision community for image segmentation [1] and other problems [4] while ratio cut has been applied in circuit layout [2]. Though these graph cut problems can be shown to be NPhard, several effective algorithms have been proposed, including eigenvectorbased approaches [1]as well as methods based on kernel kmeans
[3].Despite the success of spectral graph cut algorithms, they do suffer from several important limitations. For one, they require the number of clusters to be known before running the algorithm, but in many applications the number of clusters is not known a priori. More importantly, many graph cut objectives, such as the normalized cut objective and the ratio cut objective, favor clusters of equal size or degree, which typically leads these algorithms to produce clusters with nearly uniform sizes. Consider image segmentation, the canonical application of normalized cuts. As shown in [5], humansegmented images yield segments that are far from uniform; in fact, they follow a powerlaw distribution in terms of their segment sizes. Powerlaw distributions arise frequently in a number of other clustering applications as well. For instance, because income follows a powerlaw distribution, attempting to cluster individuals into income brackets using census data would likely fail when applying standard clustering techniques. Other phenomena exhibiting powerlaw distributions include the populations of cities, the intensities of earthquakes, and the sizes of power outages [6]. These applications—and the lack of existing graph clustering methods that specifically encourage powerlaw cluster size structure—motivate our work.
In this paper, we propose a general framework of powerlaw graph cut algorithms that encourages cluster sizes to be powerlaw distributed, and does not fix the number of clusters upfront. To achieve both goals, we borrow ideas from Bayesian nonparametrics [7], which provide a principled way to automatically infer both the parameters of a model as well as its complexity. We observe that the PitmanYor process [8], a Bayesian nonparametric prior that generalizes the Chinese restaurant process, yields clusters whose sizes follow a powerlaw distribution. We treat the PitmanYor exchangeable partition probability function (EPPF) [9] as a regularizer for graph cut objectives, so that the resulting objectives favor clusters that both have a small graph cut objective value as well as a powerlaw cluster size structure.
Algorithmically, incorporating the PitmanYor EPPF into existing cut formulations results in an optimization problem where standard spectral methods are no longer applicable. Inspired by the connection between spectral graph cut objectives and weighted kernel kmeans [3]
, we derive a simple kmeanslike iterative algorithm to optimize several powerlaw graph cut objectives. As with kmeans, our proposed algorithm is guaranteed to converge to a local optima in a finite number of steps. We further demonstrate that our graph cut problem may be viewed precisely as a MAP problem for a particular PitmanYor Gaussian mixture model. Finally, to demonstrate the utility of our algorithm, we perform extensive experiments using powerlaw normalized cuts on synthetic datasets, realworld data with powerlaw structure, and image segmentation.
Related Work:
Smallvariance asymptotics have recently been extended to Bayesian nonparametric models to yield simple kmeanslike algorithms
[10, 11]; one of the applications of that line of work is a normalized cut algorithm that does not fix the number of clusters upfront [10]. However, that approach cannot be directly applied to PitmanYor process mixture models, as smallvariance asymptotics on the PitmanYor process model fail to capture any powerlaw characteristics.The most related work to ours is [12]
, an algorithm for scalable powerlaw clustering based on adapting kmeans. Specifically, the authors propose a powerlaw data clustering algorithm based on modifying the PitmanYor process and performing a smallvariance asymptotic analysis on the modified PimanYor process. However, their objective function does not guarantee the generation of a powerlaw distributed cluster sizes and the optimal clustering solutions for their objective are often trivial. We will discuss this method further in Section
4.4 and Section 5.Finally, the work of [5] introduces a model for segmentation based on PitmanYor priors, but it is specific to the image domain whereas our method is a general graph clustering algorithm.
2 Background
We begin with a brief discussion about spectral graph cut algorithms and their connection to weighted kernel kmeans.
2.1 Spectral graph cut algorithms
In the graph clustering setting, we are given an undirected weighted graph , in which denotes vertices and denotes edges. The weight of an edge between two vertices represents their similarity. The corresponding adjacency matrix is a by matrix whose entry represents the weight of the edge between and .
The idea behind graph cuts is to partition the graph into disjoint clusters such that the edges within a cluster have high weight and the edges between clusters have low weight. Several different graph cut objectives have been proposed [1, 2, 13], among which normalized cuts [1] and ratio cut [2] are two of the most popular. Denote
i.e., the sum of the edge weights between and , and
the sum of all edge weights between and . Normalized cuts (sometimes called way normalized cuts) aims to minimize the cut relative to the degree of the cluster. The objective can be expressed as
While this objective can be shown to be NPcomplete, a relaxation of it can be globally optimized using spectral methods by computing the first eigenvalues of the normalized Laplacian constructed from the adjacency matrix [14].
The ratio cut objective differs from normalized cuts in that it seeks to minimize the cut between clusters and the remaining vertices. It is expressed as
Note that there are also other graph partitioning objectives that fall under this framework (see, e.g., Section 3 of [3] which generalizes association and cut problems to weighted variants), and our approach can also be applied to these objectives.
2.2 Weighted kernel kmeans and graph cuts
Consider the kmeans objective function with clusters :
where . It is straightforward to extend this to the weighted setting by introducing a weight for each data point, which yields the following:
where now the mean is the weighted mean . Further, we can replace the original data with mapped data and treat the entire problem in kernel space by expressing both the kmeans algorithm, along with the objective, in terms of inner products. This is necessary for the connection to graph cuts.
Dhillon et al. [3] showed that there is a connection between the weighted kernel kmeans objective and several spectral graph cut objectives. We will discuss in particular the connection to normalized cuts. Define the degree matrix as the diagonal matrix whose entries are equal to the degree of node . The surprising fact established in [3] is that normalized cuts and weighted kernel kmeans are mathematically equivalent, in the following sense: if is an adjacency matrix, then the normalized cuts objective on is equivalent to the weighted kernel kmeans objective (plus a constant) on the kernel matrix , where is chosen such that is a positive semidefinite matrix, and where the weights of the data points are equal to the degrees of the nodes. Thus, for the purposes of minimizing the weighted (kernel) kmeans objective function, we can effectively interchange the objective with the normalized cut objective, i.e.,
(1) 
for the appropriate definition of the kernel matrix. In particular, this result gives an algorithm for monotonically minimizing the normalized cut objective—we just form the appropriate kernel and set the weights to the degrees, and then run weighted kernel kmeans on that kernel matrix. Similar equivalences hold for both the ratio cut and ratio association objectives—by forming appropriate kernels and weights, the graph cut objectives can be shown to be mathematically equivalent to the weighted kernel kmeans objective.
3 The Powerlaw Normalized Cut Objective
Our goal is to propose and study a new set of graph cut objectives that produce powerlaw distributed cluster sizes. In order to achieve this, we will borrow some key ideas from Bayesian nonparametrics. More specifically, we look at the PitmanYor process [8], a generalization of the Chinese Restaurant Process that specifically yields powerlaw distributed cluster sizes. For simplicity, we will focus on the normalized cut objective as an example. One can simply replace the normalized cut objective with other graph cut objectives to obtain other powerlaw graph cut algorithms in our framework.
3.1 PitmanYor EPPF
The canonical Bayesian nonparametric clustering prior is the Chinese restaurant process (CRP) [7]. It yields a distribution on clusterings such that the number of clusters are not fixed, and where the sizes of the clusters decay exponentially. The description of the CRP is as follows: customers enter a restaurant with an infinite number of tables (each table corresponds to a cluster). The first customer sits at the first table. Subsequent customers sit at tables with probability proportional to the number of seated customers at that table, and with probability proportional to sit at a new table. The PitmanYor process leads to an extension of the CRP such that the cluster sizes instead follow a powerlaw distribution. In this modified version of the CRP, when customers sit down at tables, they sit at an existing table with probability proportional to the number of existing occupants minus (), and at a new table with probability proportional to , where is the current number of occupied tables. Thus, as the number of tables increases, there is a higher probability of starting a new table; this leads to the heavier powerlaw distribution of cluster sizes.
One can explicitly write down the probability of observing a particular seating arrangement under the PitmanYor CRP, and the resulting formula is known as the PitmanYor exchangeable partition probability function (EPPF) [9]. If we let
be an indicator matrix for the resulting clustering, then the probability distribution of
under the PitmanYor CRP is expressed by the following unintuitive and somewhat cumbersome form:(2) 
where
is the size of cluster , and is defined as . One can verify that, when , we actually obtain the original CRP probability distribution. One can also show that the expected number of clusters under this distribution is , and that we obtain the desired powerlaw cluster size distribution.
3.2 Powerlaw normalized cut objective
To obtain powerlaw distributed cluster sizes within a graph clustering setting, we treat the PitmanYor EPPF as a regularizer for the cluster indicator matrix of normalized cuts. Then our resulting objective is given as below:
(3)  
where is the indicator for the cluster assignment of each node, is the negative log of the PimanYor EPPF and is a tradeoff between the original graph cut objective and the regularization term. The first term is the standard normalized cut objective. The desired powerlaw distributed partition would give a high value of the PitmanYor EPPF and thus a low value of the second term. Therefore, the clustering result that minimizes this objective should give a partition of the graph such that both similarity information is preserved and cluster sizes are powerlaw distributed.
4 Optimization
The objective function (3) defined in the previous section enforces a tradeoff between standard normalized cuts and a preference for powerlaw cluster size structure. We now turn to optimization of the resulting objective.
4.1 The vector case
The first observation that we can make is that spectral methods will not apply to our proposed objective. Recall that for the normalized cut objective, a standard approach is to relax the cluster indicator matrix to be continuous, leading to a simple eigenvector problem that can be optimized globally. When applying such a technique to the powerlaw normalized cut objective, one would need to incorporate the regularization term appropriately into the trace maximization problem that emerges from the spectral solution, but this turns out to be impossible.
Instead we must turn to the other main optimization strategy for normalized cuts—namely the equivalence to weighted kernel kmeans—and we will adapt the weighted kernel kmeans algorithm for our problem. To start, in this section we will derive a kmeanslike algorithm for the following regularized kmeans problem:
where the means are the weighted means of the points in as in standard weighted kmeans as discussed in Section 2.2. Once we have obtained the algorithm for this case, we can easily extend the connection between normalized cuts and weighted kernel kmeans to obtain an algorithm for monotonic local convergence of the powerlaw normalized cut objective. Note that this treatment is equally applicable to the ratio cut and ratio association objectives.
We observe that, when the cluster indicators are fixed, the weighted mean is justified in the above objective since it is the best cluster representative for each cluster in terms of the objective function, i.e., for fixed and any choice of , the regularizer is constant and we have by simple differentiation
Therefore, the updates on will be exactly as in standard weighted kmeans.
The other step is the update on the indicators . In standard kmeans, these updates are derived by fixing the means and minimizing the kmeans objective function with respect to each , which yields the usual kmeans assignment step. The PitmanYor EPPF regularizer makes the assignment updates somewhat less trivial, but it is still fairly straightforward. For each data point we consider the objective function when assigning that point to every existing cluster, as well as to a new cluster, and assign to the cluster that results in the smallest objective function. The regularizer effectively adds a “correction” to each distance computation . Let be the number of points in cluster . After going through the algebra, we arrive at the following: if is currently assigned to , then we have that the distance to another cluster (ignoring constants, which do not affect reassignment) is given by the following cases:

If :

if and is an existing cluster,

if and is an existing cluster,

if and is a new cluster

if and is a new cluster
Observe that the distance to new clusters goes down as increases, which is analogous to the property in the PitmanYor version of the Chinese restaurant process of being more likely to start a new table as the number of tables increases. In a similar way, when computing the distance to existing clusters, the distance becomes smaller as the cluster gets larger (i.e., as goes up), leading to the “rich gets richer” behavior. Finally, whenever a new cluster is started by some point , we immediately set the mean to be . See Algorithm 1 for a full specification. Note that, analogous to the convergence proof of kmeans, one can easily show that this algorithm monotonically decreases the regularized kmeans objective until local convergence.
4.2 Powerlaw normalized cut algorithm
Recall that in Section 2.2 we discussed the equivalence between the graph cuts formulation and the weighted kernel kmeans objective, as in (1
). With this equivalence in hand, the extension from the vector case to the powerlaw graph cut objectives follows easily: we simply replace the weighted kmeans term with a graph cuts term, which gives exactly the same objective with our powerlaw graph cuts objective in (
3) up to a constant; then we apply Algorithm 1 in kernel space to solve the resulting optimization problem.More specifically, given a graph with adjacency matrix , our powerlaw normalized cut algorithm is described as follows:

Compute the degree matrix from as the diagonal matrix whose entries are equal to the degree of node .

Compute the kernel matrix from using .
In kernel space, the regularized distance remains unchanged. The only change is that now we need to compute instead of . We expand the last distance computation and use the formula for and obtain:
Using the kernel matrix , the above may be written as:
We note that, as when applying weighted kernel kmeans to the standard normalized cut problem [3], each iteration of Algorithm 1 when applied in kernel space with requires time , making it very scalable for applications to large graphs. Also note that by using an appropriate kernel matrix , we can utilize other graph cut objectives in this framework.

If , .

if and is an existing cluster,

if is a new cluster,

If , .

if and is an existing cluster,

if is a new cluster,
4.3 Connection to PitmanYor MAP inference
Finally, we briefly consider the connections between our proposed objective and a simple PitmanYor process mixture model. Consider the following Bayesian nonparametric generative model:
where PYCRP refers to the PitmanYor Chinese Resturant Process. To perform MAP inference, we can write down the joint likelihood and maximize with respect to the relevant parameters:
where . Note that the minimization with respect to yields precisely the weighted means, and so based on the equivalence between weighted kernel kmeans and normalized cuts, we can see that our proposed objective function may be viewed in a MAP inference framework. This framework also justifies the use of the log of the PitmanYor EPPF as a regularizer.
4.4 Comparison to existing powerlaw clustering algorithm means
In [12], the authors propose a different objective for powerlaw data clustering, namely:
which adds a term to the dpmeans objective function [10]. While this objective function does incorporate the number of clusters into the optimization, it does not require or encourage the cluster sizes to follow a powerlaw distribution. Moreover, in their experiments, the authors set . In this case, the objective function becomes:
One can show that, when the number of data points exceeds , the trivial clustering result, namely every data point is a singleton cluster, will minimize this objective. This can be seen by the fact that the trivial clustering result minimizes the means objective by simply being and that of data points minimizes the regularization term. In short, this objective is not appropriate for powerlaw clustering applications. In the following experiment section, we will also compare our algorithm with their method empirically.
5 Experiments
We conclude with a brief set of experiments demonstrating the utility of our methods.
Namely, we will show that our approach enjoys benefits over the kmeans algorithm on real powerlaw datasets in the vector setting and benefits over standard normalized cuts^{1}^{1}1Normalized cut image segmentation code:
http://www.cis.upenn.edu/ jshi/software/. on synthetic and real data in the graph setting.
We also compare our method with the means [12] and show that our method achieves better clustering results.
Throughout the experiments, we use normalized mutual information (NMI) between the algorithm’s clusters and the groundtruth clusters for evaluation.
Synthetic powerlaw graph data. We begin with a synthetic powerlaw random graph dataset generated by the PitmanYor process applied to the stochastic block model. Specifically, the PitmanYor CRP is first used to generate data cluster assignments and then a standard stochastic block model uses the assignments to generate a random graph. We create a dataset with nodes with disjoint clusters using the above process, with the corresponding adjacency matrix shown in the left of Figure 1. The parameters and we use in the PitmanYor process model is and
respectively. In the stochastic block model, the stochastic block matrix is sampled from two Gaussian distributions: one being
for diagonal entries and the other being for nondiagonal entries. Our powerlaw normalized cut algorithm is then applied on this dataset with parameters validated on a separate validation dataset generated from the same process. We compare with normalized cuts with its set to be the groundtruth. The results are shown in Figure 1; normalized cuts splits the big clusters while our algorithm nearly produces the groundtruth clusters.NMI  

Dataset  Ours  means  means 
audiology  0.621  0.518  0.417 
ecoli  0.700  0.545  0.608 
glass  0.427  0.315  0.297 
hypothyroid  0.024  0.009  0.077 
pageblocks  0.209  0.123  0.088 
flags  0.275  0.198  0.178 
Real world powerlaw data sets. Next we consider comparing Algorithm 1 with kmeans and means [12] on real world benchmark data sets to demonstrate that our algorithm performs best on clustering vector data when cluster sizes are powerlaw distributed. We selected UCI classification datasets whose class labels are powerlaw distributed (see Figure 2) and use class labels as the groundtruth for clusters. Each dataset is then randomly split 30/70 for validation/clustering. We normalize the datasets so that the values of all features lie in . On each validation set, we validate the parameters of Algorithm 1 (i.e. ) and the parameters of means only to yield cluster numbers close to the groundtruth (to make a fair comparison with kmeans). On each clustering set, we use the validated parameter settings for Algorithm 1 and means and the groundtruth for kmeans to perform the clustering. The NMI are computed between the groundtruth and the computed clusters, and results are averaged over runs, as shown in Table 1. As we can see, Algorithm 1 performs better than kmeans on all datasets in terms of NMI. Also, it is better than means on all datasets except on the hypothyroid. Note that the means is better than means in datasets and worse than means in the other . Such high variance results on powerlaw datasets make us doubt that means is really able to achieve powerlaw clustering. In Figure 2, we show the resulting clusterings on the ecoli dataset given by Algorithm 1, means and kmeans, in which we use the whole dataset for clustering with validated parameters. It is clear that kmeans produces more uniform clusters and means also splits the largest cluster in the dataset.
NMI  

Dataset  Ours  Normalized cuts 
audiology  0.662  0.561 
ecoli  0.702  0.591 
glass  0.432  0.356 
hypothyroid  0.011  0.008 
pageblocks  0.222  0.126 
flags  0.357  0.200 
Real world powerlaw graph data sets. In this part we convert the UCI vector datasets used in the preceding experiment to form powerlaw graphs and perform powerlaw normalized cuts on these graphs. We also run normalized cuts algorithm on these graphs to compare with our method.
To obtain the graphs, we first form the adjacency matrix by using a Gaussian similarity kernel on the vector data after normalizing them to . Then we use the adjacency matrix to form the kernel matrix and the weights as dsiccused in Section 4.2. We randomly split data into validation/clustering with ratio of 30/70. Parameters are selected on the validation set so that cluster numbers are close to the groundtruth. The number of clusters in normalized cuts is set to the true number of clusters. This is again for a fair comparison with normalized cuts. Finally, we apply our powerlaw normalized cuts and normalized cuts on the clustering dataset. NMI averaged over 10 runs are shown in Table 2.
As we can see, our powerlaw normalized cuts is better than normalized cuts on all the graphs in terms of NMI.
Image segmentation. Finally, we briefly demonstrate some qualitative results on image segmentation on the Berkeley segmentation data set [15]. We adopt an approach that is similar to the approach in [1]
to compute the affinity matrix. Then we perform our powerlaw normalized cuts with the affinity matrix. We compare standard normalized cuts with our proposed method on graphs generated from input images. Figure
3 displays some example images; we see that normalized cuts tends to break up large segments more often than our approach.6 Conclusion
We proposed a general framework of powerlaw graph cut algorithms that produce clusters whose sizes are powerlaw distributed, and also does not fix the number of clusters upfront. The PitmanYor exchangeable partition probability function (EPPF) was incorporated into powerlaw graph cut objectives as a regularizer to promote powerlaw cluster size distributions. A simple iterative algorithm was then proposed to locally optimize several objectives. Our proposed algorithm can be viewed as performing MAP inference on a particular PitmanYor mixture model. Finally, we conducted experiments on various data sets and showed the effectiveness of our algorithms against competing baselines.
References
References
 [1] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 22(8):888–905, 2000.
 [2] Pak K Chan, Martine DF Schlag, and Jason Y Zien. Spectral kway ratiocut partitioning and clustering. ComputerAided Design of Integrated Circuits and Systems, IEEE Transactions on, 13(9):1088–1096, 1994.
 [3] Inderjit S Dhillon, Yuqiang Guan, and Brian Kulis. Weighted graph cuts without eigenvectors a multilevel approach. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(11):1944–1957, 2007.

[4]
DM Greig, BT Porteous, and Allan H Seheult.
Exact maximum a posteriori estimation for binary images.
Journal of the Royal Statistical Society. Series B (Methodological), pages 271–279, 1989.  [5] Erik B Sudderth and Michael I Jordan. Shared segmentation of natural scenes using dependent PitmanYor processes. In NIPS, pages 1585–1592, 2008.
 [6] Aaron Clauset, Cosma Rohilla Shalizi, and Mark EJ Newman. Powerlaw distributions in empirical data. SIAM review, 51(4):661–703, 2009.
 [7] Nils Hjort, Chris Holmes, Peter Mueller, and Stephen Walker. Bayesian Nonparametrics: Principles and Practice. Cambridge University Press, Cambridge, UK, 2010.
 [8] Jim Pitman and Marc Yor. The twoparameter PoissonDirichlet distribution derived from a stable subordinator. The Annals of Probability, pages 855–900, 1997.

[9]
Jim Pitman.
Combinatorial Stochastic Processes.
SpringerVerlag, 2006.
Lectures from the SaintFlour Summer School on Probability Theory.
 [10] Brian Kulis and Michael I Jordan. Revisiting kmeans: New algorithms via Bayesian nonparametrics. In Proceedings of the 29th International Conference on Machine Learning (ICML12), pages 513–520, 2012.
 [11] Tamara Broderick, Brian Kulis, and Michael I Jordan. MADbayes: MAPbased asymptotic derivations from Bayes. In Proceedings of the 30th International Conference on Machine Learning (ICML13), 2013.
 [12] Xuhui Fan, Yiling Zeng, and Longbing Cao. Nonparametric powerlaw data clustering. CoRR, abs/1306.3003, 2013.

[13]
Brian W Kernighan and Shen Lin.
An efficient heuristic procedure for partitioning graphs.
Bell system technical journal, 49(2):291–307, 1970. 
[14]
Stella X Yu and Jianbo Shi.
Multiclass spectral clustering.
In Computer Vision, 2003. Proceedings. Ninth IEEE International Conference on, pages 313–319. IEEE, 2003.  [15] Pablo Arbelaez, Michael Maire, Charless Fowlkes, and Jitendra Malik. Contour detection and hierarchical image segmentation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 33(5):898–916, 2011.
Comments
There are no comments yet.