Clustering and Community Detection with Imbalanced Clusters

08/26/2016 ∙ by Cem Aksoylar, et al. ∙ 0

Spectral clustering methods which are frequently used in clustering and community detection applications are sensitive to the specific graph constructions particularly when imbalanced clusters are present. We show that ratio cut (RCut) or normalized cut (NCut) objectives are not tailored to imbalanced cluster sizes since they tend to emphasize cut sizes over cut values. We propose a graph partitioning problem that seeks minimum cut partitions under minimum size constraints on partitions to deal with imbalanced cluster sizes. Our approach parameterizes a family of graphs by adaptively modulating node degrees on a fixed node set, yielding a set of parameter dependent cuts reflecting varying levels of imbalance. The solution to our problem is then obtained by optimizing over these parameters. We present rigorous limit cut analysis results to justify our approach and demonstrate the superiority of our method through experiments on synthetic and real datasets for data clustering, semi-supervised learning and community detection.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

page 9

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

We consider graph partitioning problems with imbalanced partition sizes for two different graph modalities: similarity networks where we are given a measure of similarity for each pair of nodes (such as distance) and connectivity networks where we have a set of nodes and unweighted edges between pairs of nodes. The first modality is used frequently for graph-based spectral methods for clustering and semi-supervised learning (SSL) tasks. In this context, data with imbalanced clusters arises in many learning applications and has attracted much interest [1]. The second modality is the setup for community detection problems where identification of communities with small sizes has been considered in the literature [2].

In spectral methods for clustering and SSL, first a graph representing the data is constructed and then spectral clustering (SC) [3, 4] or SSL algorithms [5, 6] are applied on the resulting graph. Common graph construction methods include -graphs, fully-connected RBF-weighted (full-RBF) graphs and -nearest neighbor (-NN) graphs. Of the three,

-NN construction appears to be most popular due to its relative robustness to outliers

[5, 7]. Recently [8] proposed -matching graphs which claim to eliminate some of the spurious edges of -NN graphs and lead to better performance. Model-based approaches that incorporate imbalancedness have previously been investigated [9], however they typically assume simple cluster shapes and need multiple restarts. In contrast non-parametric graph-based approaches do not have this issue and are able to capture complex shapes [10].

Graph partitioning methods on connectivity networks based on spectral clustering are frequently used in the literature [11] for the problem of community detection [12]. One limitation of spectral methods is that they often fail to detect smaller community structures in dense networks [2]. While this limitation can be observed empirically, there also exist recent theoretical results for the stochastic block model [13, 14] that allows the quantification of the difficulty of detection relative to community sizes.

We also remark that community detection problems are not limited to the connectivity network setup and the similarity network modality is also useful when there exists additional information associated with the nodes or edges. One example of such a problem is a citation network, where each node represents an academic paper with text content. Similarity between papers can be quantified by extracting topics from the documents and using the topic distributions in two papers to compute a similarity score. This information can be used in addition to citation information (which by itself is a connectivity network) to discover communities, e.g. corresponding to research fields. These kinds of approaches combining the two modalities have been investigated in the literature [15, 16].

To the best of our knowledge, systematic ways of adapting spectral methods to imbalanced data do not exist. We show that the poor performance of spectral methods on imbalanced data can be attributed to applying ratio cut (RCut) or normalized cut (NCut) minimization objectives on traditional graphs, which sometimes tend to emphasize balanced partition size over small cut-values.

Our contributions:
To deal with imbalanced data we propose the partition constrained minimum cut problem (PCut). We remark that size-constrained min-cut problems appear to be computationally intractable [17, 18], thus instead we attempt to solve PCut on a parameterized family of cuts. To realize these cuts we parameterize a family of graphs over some parametric space and generate candidate cuts using spectral methods as a black-box. This requires a sufficiently rich graph parameterization capable of approximating varying degrees of imbalanced data. To this end we introduce a novel parameterization for graphs that involves adaptively modulating node degrees in varying proportions, for both similarity and connectivity networks. We then solve PCut on a baseline graph over the candidate cuts generated using this parameterization. Fig. 1 depicts our approach for binary clustering. Our limit cut analysis shows that our approach asymptotically does adapt to imbalanced and proximal clusters. We then demonstrate the superiority of our method through unsupervised clustering, semi-supervised learning and community detection experiments on synthetic and real datasets. Note that we do not presume imbalancedness in the underlying cluster sizes. Our method significantly outperforms traditional approaches when the underlying clusters are imbalanced, while remaining competitive when they are balanced. Our paper is based in part on preliminary results described in [19].

Fig. 1: Proposed framework for clustering on imbalanced data.

Related work:
Sensitivity of spectral methods to graph construction in similarity networks is well documented [7, 20, 21]. [22] suggests an adaptive RBF parameter in full-RBF graphs to deal with imbalanced clusters. [23] describes these drawbacks from a random walk perspective. [24, 25] also mention imbalanced clusters, but none of these works explicitly deal with imbalanced data. We remark that our approach is complementary to their schemes and can be used in conjunction. Another related approach is size-constrained clustering [26, 27, 28, 29, 30, 17], which is shown to be NP-hard. [31] proposes submodularity based schemes that work only for certain special cases. In addition, these works either impose exact cardinality constraints or upper bounds on the cluster sizes to look for balanced partitions. While this is related, we seek minimum cuts with lower bounds on smallest-sized clusters. Minimum cuts with lower bounds on cluster size naturally arises because we seek cuts at density valleys (accounted for by the min-cut objective) while rejecting singleton clusters and outliers (accounted for by cluster size constraint). It is not hard to see that our problem is computationally no better than min-cut with upper bounds of size constraints.111In the 2-way partition setting, min-cut with lower bounds is equivalent to min-cut with upper bounds and is thus NP-hard. The multi-way partition problem generalizes the 2-way setting.

A related area of recent active research is the detection of anomalous clusters in signals over networks [32, 33, 34]. This line of work focuses on the detection of well-connected subgraphs of given network data, which complements the approach considered here which aims to partition the given graph to well-connected subgraphs.

The organization of the paper is as follows. In Sec. II we propose our partition constrained min-cut (PCut) framework, illustrate some of the fundamental issues underlying poor performance of spectral methods on imbalanced data and explain how PCut can deal with it. We describe the details of our PCut algorithm for both similarity and connectivity networks in Sec. III, and explore its theoretical basis for the similarity network framework in Sec. IV. In Sec. V we present experiments on synthetic and real datasets to show significant improvements in SC, SSL and community detection tasks. Sec. VI concludes the paper.

Ii Partition Constrained Min-Cut (PCut)

We formalize the PCut problem for the similarity network and connectivity network modalities. Let be a weighted undirected graph with nodes, where the weights are similarity measures between two nodes, which is equal to 1 uniformly for connectivity graphs. We denote by a cut that partitions into and . The cut-value associated with is:

(1)

For similarity graphs the edge set may be generated from the similarity weights , by constructions such as the -NN graph where each node is connected to its -closest (i.e. most similar) neighbors, -graph where nodes are connected to all other nodes with a larger than similarity value, or a full-RBF graph with edges between all node pairs. We refer to these methods for edge set generation as graph construction methods.

We pose the problem of partition size constrained minimum cut (PCut) as:

(2)

where we also denote the partition of nodes corresponding to the optimal cut with .

While our proposed method does not necessitate features associated with nodes in similarity networks and requires only that similarity scores exist for each pair of nodes, for analysis purposes we consider a generative model where we assume that each node has features that is drawn from some unknown density . In this generative framework, PCut corresponds to searching for a hypersurface that partitions into two subsets and (with ) with non-trivial mass and passes through low-density regions.

We remark that while such a formulation is natural for learning problems such as clustering or SSL, such representations arise in many other problems where we have additional information associated with nodes or edges. One example is a collaborative filtering problem for recommender systems with users where user ratings for movies can be represented with a linear factor model

, where each user is represented by a latent feature vector

and each movie by . Then probabilistic matrix factorization methods such as [35]

induce a probability distribution over users’ latent features

that are independent and identically distributed. Another example is citation networks, where each node represents an academic paper with text content. In this case each node can be represented by a distribution over topics using topic modeling, which are sampled from induced probability spaces in generative methods such as latent Dirichlet allocation [36].

Throughout the paper, we refer to as the imbalance parameter for a graph with two underlying clusters with sizes and . Note that for the generative model this corresponds to where is the optimal size-constrained partitioning.

Eq. (2) describes a binary partitioning problem but generalizes to arbitrary number of partitions, for which we state PCut as:

While we consider the binary problem with the hypersurface interpretation of cuts for analyzing limit-cut behavior, we present our algorithm in Sec. III for the generalized problem and utilize it in experiments in Sec. V.

Note that without size constraints the problem in Eq. (2) is identical to the min-cut criterion [37], which is well-known to be sensitive to outliers. This objective is closely related to the problem of graph partitioning with size constraints, with various versions known to be NP-hard [18]. Approximations to such partitioning problems have been developed [28] but appear to be overly conservative. More importantly these papers [28, 29, 30] either focus on balanced partitions or cuts with exact size constraints. In contrast our objective here is to identify natural low-density cuts that are not too small (i.e. with lower bounds on the smallest sized cluster). We employ SC as a black-box to generate candidate cuts on a suitably parameterized family of graphs. Eq. (2) is then optimized over these candidate cuts.

Ii-a RCut, NCut and PCut

The well-known spectral clustering algorithms attempt to minimize RCut or NCut:

(3)

where for RCut and for NCut, and for simplicity we considered the binary problem. Both objectives seek to trade-off low cut-values against cut size. While robust to outliers, minimizing RCut/NCut can lead to poor performance when data is imbalanced (i.e. with small ). To see this, we define cut-ratio and imbalance coefficient for some graph

where corresponds to optimal PCut and is any balanced partition with .

Remark II.1

For any partition

A similar expression holds for NCut with appropriate modifications. Because varies for different graphs but does not, the ratio depends on the underlying graph in a connectivity network and the specific graph construction in a similarity network. So it is plausible that for some graphs RCut/NCut value satisfies while for others . In the former case RCut/NCut will favor a balanced cut over the ground-truth cut and vice versa if the latter is true. Fig. 2 illustrates this behavior.

Fig. 2: Cut-ratio () vs. imbalance (). RCut value is smaller for balanced cuts than imbalanced optimal cuts for cut-ratios above the curve.

We analyze the limit-cut behavior of -NN, -graphs and RBF graphs to build intuition for the similarity network case. For properly chosen , and [20, 38], as sample size , we have

(4)

where for -NN and for -graph and full-RBF graphs. For the similarity network, we can then make the following remarks for the asymptotic scenario:

  1. While cut-ratio varies with graph construction, the imbalance coefficient is invariant. In particular we expect for -NN to be larger relative to for full-RBF and -graphs since .

  2. We expect PCut to output similar results for all graph constructions. This follows from the limit-cut behavior and the limiting independence of to graph construction.

  3. We can loosely say that if data is imbalanced and with sufficiently proximal clusters then asymptotically -NN, full-RBF and

    -graphs can all fail when RCut is minimized. To see this consider an imbalanced mixture of two Gaussians. By suitably choosing the means and variances we can construct sufficiently proximal clusters with same imbalance but relatively large

    values. This is because will be relatively large even at density valleys for proximal clusters. Our statement then follows from Remark II.1.

Similar to the similarity network, we can build intuition for the performance of SC in the case of imbalanced clusters using the stochastic block model (SBM) as the generative process for a connectivity network. The stochastic block model is widely used for community detection to parameterize the problem using edge existence probabilities within and between communities. Consider a network with two communities of sizes , where for this case we have w.l.o.g. and assume subnetworks for the communities are generated by the Erdős-Rényi graph with internal edge existence probabilities , and edge probability between two nodes in different subnetworks.

Using the phase transition analysis proposed by

[13], we observe that an asymptotic lower bound to (i.e. if SC recovers clusters almost surely) is while an asymptotic upper bound (i.e. if SC fails to recover clusters almost surely) is given by . The gap between the two bounds widens as approaches zero and the lower bound shrinks linearly with for most scaling regimes of interest, e.g. when . This result implies that the performance of spectral clustering based community detection is more uncertain and tends to be worse when trying to detect small communities.

In summary, for similarity networks we have learned that the optimal RCut/NCut depend on graph construction and can fail for imbalanced proximal clusters for -NN, -graph, full-RBF constructions on same data. PCut is computationally intractable but asymptotically invariant to graph construction and picks the right solution. Since SC is a relaxed variant of optimal RCut/NCut, we can expect it to have similar behavior to the optimal RCut/NCut. Similarly, we argued that the SC performance on a given graph is expected to deteriorate for small clusters in connectivity networks. This motivates the following section.

Ii-B Using spectral clustering for PCut

For the data clustering/SSL problem, the discussion in Sec. II-A suggests the possibility of controlling cut-ratio through modification of the underlying graph parameters while not impacting (which is invariant to different graph constructions). This key insight leads us to the following framework for solving PCut:

  • Parameter optimization: Generate several candidate optimal RCuts/NCuts as a function of graph parameters. Pose PCut over these candidate cuts rather than arbitrary cuts as in Eq. (2). Thus PCut is now parameterized over graph parameters.

  • Graph parameterization: If the graph parameters are not sufficiently rich to allow for adaptation to imbalanced or proximal cuts, (A) would be useless. Therefore, we want graph parameterizations that allow sufficient flexibility such that the posed optimization problem is successful for a broad range of imbalanced and proximal data.

We first consider the second objective. For the similarity network, we have found in our experiments (cf. Sec. V) that the parameterization based on RBF -NN graphs is not sufficiently rich to account for varying levels of imbalanced and proximal data. To induce even more flexibility we introduce a new parameterization that we also generalize to connectivity networks.

Rank modulated degree (RMD) graphs:
We introduce RMD graphs that are a richer parameterization of graphs that allow for more control over and offers sufficient flexibility in dealing with a wide range of imbalanced and proximal data. Our framework adaptively modulates the node degrees on the given baseline graph, while selectively removing edges in low density regions and adding in high-density regions. This modulation scheme is based on a novel ranking scheme for data samples introduced in Sec. III, which reflect the relative density and allows the identification of high/low density nodes. For similarity networks, we consider -NN graphs since it is easier to ensure graph connectivity compared to -graphs.

For connectivity networks, we adopt a similar scheme that selectively removes edges from the given graph to improve SC performance. We again aim to remove edges between the clusters (“low density” regions), while keeping edges that are present inside the clusters (“high density” regions). Since we do not have similarity measures between nodes, we use other metrics to choose how many and which edges to remove for a node.

We are now left to pose PCut over graph parameters or candidate cuts, which we describe in detail in the following section. For similarity networks we construct a universal baseline graph for the purpose of comparison among different cuts and to pick the cut that solves Eq. (2). These different cuts are obtained by means of SC and are parameterized by graph construction parameters. PCut is then solved on the baseline graph over candidate cuts realized from SC. The optimization framework is illustrated in Fig. 1 in the context of data clustering.

Iii Our Algorithm

Iii-a Similarity networks

Given data samples, our task is unsupervised clustering or SSL, assuming the number of clusters/classes is known. We start with a baseline -NN graph built on these samples with large enough to ensure graph connectivity. Main steps of our PCut framework are as follows.

Main Algorithm:  RMD Graph-based PCut
1. Compute the rank of each sample ;
2. For different configurations of parameters,
    a. Construct the parametric RMD graph;
    b. Apply spectral methods to obtain a -partition on the current RMD graph;
3. Among various partition results, pick the “best” (evaluated on baseline ).

(1) Rank computation:
We compute the rank of every node as follows:

(5)

where denotes the indicator function and is a statistic reflecting the relative density at node . Since is unknown, we choose average nearest neighbor distance as a surrogate for . To this end let be the set of all neighbors for node on the baseline graph, and we let

(6)

The ranks

are relative orderings of samples and are uniformly distributed.

indicates whether a node lies near density valleys or high-density areas, as illustrated in Fig. 3.

Fig. 3:

Density level sets & rank estimates computed from samples for a mixture of two Gaussians.

(2a) Parameterized graph construction:
We consider three parameters, , for -NN and for RBF similarity. These are then suitably discretized. We generate a weighted graph on the same node set as the baseline graph but with different edge sets. For each node we construct edges to its nearest neighbors, with given by

(7)

which results in the RMD parameterization through different . Note that corresponds to no degree modulation. For non-RMD parameterizations that we compare to (such as RBF -NN) we vary and .

(2b) Obtaining cuts on the parameterized graphs:
From we generate a family of partitions . These cuts are generated based on the eventual learning objective. For instance, if -clustering is the eventual goal these -cuts are generated using SC. For SSL we use RCut-based Gaussian random fields (GRF) and NCut-based graph transduction via alternating minimization (GTAM) to generate cuts. These algorithms all involve minimizing RCut/NCut as the main objective (SC) or some smoothness regularizer (GRF, GTAM). For details about these algorithms readers are referred to [5, 6, 7, 39].

(3) Parameter optimization:
The final step is to solve Eq. (2) on the baseline graph . We assume prior knowledge that the smallest cluster is at least of size . The -partitions obtained from step (2b) are parameterized as . We optimize over the parameters to obtain the minimum cut partition on

(8)

where denotes evaluating cut values on the baseline graph . Partitions with clusters smaller than are discarded.

Remarks:

  1. While step (3) suggests a grid search over several parameters, it turns out that other parameters such as do not play an important role as . Indeed, the experiments in Sec. V show that while step (3) can select appropriate , it is by searching over that adapts spectral methods to data with varying levels of imbalancedness (cf. Table II, RBF k-NN vs. RBF RMD).

  2. Our framework uses existing spectral algorithms and thus can be combined with other graph-based partitioning algorithms to improve performance for imbalanced data, such as 1-spectral clustering, sparsest cut or minimizing conductance [24, 40, 41, 42]. We utilize SC for data clustering and GRF/GTAM algorithms for the SSL problem in our experiments in Sec. V, with the same RMD graph parameterization framework.

Iii-B Connectivity networks

We adapt the rank computation framework and the degree modulation scheme from similarity networks to the connectivity networks case. Since we do not have access to similarity scores between nodes such as distances as in similarity networks, it is not possible to directly adapt the computation of the score function using similarities between nodes. To this end, we adopt the count of common neighbors metric as a similarity indicator between two nodes. This statistic is defined by where denotes the set of neighbors of a node

and is used frequently as a heuristic measure of similarity in applications such as link prediction

[43]. One interesting application of the statistic is community detection without spectral clustering, where [44] considers the scenario with exactly balanced community sizes and aims to discover clusters directly using the statistic. In contrast, we focus on clusters with imbalanced sizes and use the statistic only as a similarity measure to construct an analogy to the similarity network case. The intuition about the count of common neighbors statistic is that two nodes that are in the same cluster share more neighbors (which are mostly from the same cluster) than two nodes from different clusters. Thus it can be used as a measure to determine whether or not two nodes belong to the same cluster.

To extend the RMD framework to the connectivity network, we again compute the rank of a node from a relative “density” function , which for this case we define as

where we essentially replaced the Euclidean distance with the negative similarity . Then using this relative density measure, the rank is computed as in Eq. (6). In this context, the nodes with high rank are “high density” nodes, where they are connected more frequently to their own clusters than to different clusters compared to the average node in the graph. On the other hand “low density” nodes with low rank are connected more to other clusters compared to the average case.

Given the rank of a node, we modulate the node’s degree using the formula

where we differ from Eq. (7) by not multiplying the term by 2. This is because we only decrease the degrees of the nodes by removing edges, rather than increasing or decreasing according to rank. One remaining issue is which edges to remove from a node given that its new degree is less than original degree . Considering the analogy to the -NN graph, we remove the edges that are connected to neighbors farthest from , i.e. for which the count of common neighbors is smallest. This procedure prioritizes the removal of edges to neighbors in other clusters before the edges that connect to neighbors in the same cluster. In addition, more edges are removed from nodes with lower rank, i.e. nodes which connect to nodes in other clusters more frequently than the average. Similarly, less edges are removed from nodes with higher rank, i.e. nodes that do not connect to other clusters as frequently.

The parameterization and parameter optimization in parts (2b) and (3) follow as in the similarity network case, where the only search parameter we use is and not other parameters such as or . We note that it would be possible to determine the new degree of a node in a more robust manner given parameters such as the cluster size imbalance and the probabilities , and in the stochastic block model considered in Sec. II-A, however we do not assume knowledge of these parameters and instead use the rank of a node and the parameterization over to account for their uncertainty.

Iv Analysis of RMD for Similarity Networks

We now present an asymptotic analysis for binary cuts in similarity networks that shows how RMD helps control the cut-ratio

, introduced in Sec. II. We remark that since we analyze the limit cut behavior of RCut/NCut that is directly related to SC, it may be possible to extend it to other methods such as GTAM for SSL that are based on the NCut objective.

Assume the dataset is drawn i.i.d. from an underlying density in . Let

be the unweighted RMD graph. Given a separating hyperplane

, denote with the two subsets of split by and let denote the volume of the unit ball in . Assume the density satisfies the regularity conditions stated below.

Regularity conditions: has a compact support, and is continuous and bounded: . It is smooth, i.e. , where is the gradient of at . There are no flat density regions, i.e.  for all in the support and , where is a constant.

First we show the asymptotic consistency of the rank at some point . The limit of is , which is defined as the complement of the volume of the level set containing . Note that exactly follows the shape of and is always between no matter how scales.

Theorem IV.1

Assume satisfies the above regularity conditions. As , we have

(9)

This theorem implies that the rank of a point is a good estimate of , which is in turn related to the shape of the density . Thus is a useful metric for identifying high and low density points which is necessary for modulating node degrees to emphasize density valleys in the RMD framework.

The proof involves the following two steps:

  • The expectation of the empirical rank is shown to converge to as .

  • The empirical rank is shown to concentrate at its expectation as .

Details can be found in the appendix. Small/large values correspond to low/high density respectively. asymptotically converges to an integral expression, so it is smooth (Fig. 3). Also is uniformly distributed in , which makes it appropriate to modulate the degrees with control of minimum, maximum and average degrees.

Next we study RCut and NCut values induced on the unweighted RMD graph. We assume for simplicity that each node is connected to exactly nearest neighbors given by Eq. (7). The limit cut expression on RMD graph involves an additional adjustable term which varies point-wise according to the density.

Theorem IV.2

Assume satisfies the above regularity conditions and also the general assumptions in [20]. Let be a fixed hyperplane in . For an unweighted RMD graph, set the degrees of points according to Eq. (7), where is a constant. Let . Assume . Assume if and assume if . Then as we have that

(10)
(11)

where , and .

The proof shows the convergence of the cut term and balancing term respectively:

(12)
(13)

The analysis is an extension of [20] and the proof is provided in the appendix.

Theorem IV.2 shows that the RMD parameterization can affect the RCut/NCut behavior in a meaningful manner as the densities are modulated with the term that varies with parameter . We discuss the effects of this modulation on imbalanced data next.

Imbalanced data & RMD graphs:
In the limit cut behavior, without our term, the balancing term could induce a larger RCut/NCut value for the density valley cut than the balanced cut when the underlying data is imbalanced, i.e.  is small. Applying our parameterization scheme inserts an additional term in the limit-cut expressions. is monotonic in the -value and thus allows the cut-value to be further reduced/increased at low/high density regions. Indeed for small values, cuts near peak densities have and so , while near valleys we have . This has a direct bearing on the cut-ratio since small can reduce the cut-ratio for a given (see Fig. 2) and leads to better control of cuts on imbalanced data. In summary, this analysis shows that RMD graphs used in conjunction with the optimization framework of Fig. 1 can adapt to varying levels of imbalanced data.

V Experiments

Experiments in this section involve both synthetic and real datasets, where we consider data clustering and semi-supervised learning with similarity networks for the first two subsections and consider community detection with connectivity networks in the third subsection. For the similarity network problems, we focus on imbalanced data by randomly sampling from different classes disproportionately. For comparison purposes we compare the RMD graph with full-RBF, -graph, RBF -NN, -matching graph [21] and full graph with adaptive RBF (full-aRBF) [22]. We view each as a parametric family of graphs parameterized by their relevant parameters and optimize over different parameters as described in Sec. III and Eq. (8). For RMD graphs we also optimize over in addition. Error rates are averaged over 20 trials.

For clustering experiments we apply both RCut and NCut, but focus mainly on NCut for brevity as NCut is generally known to perform better. We report performance by evaluating how well the cluster structures match the ground truth labels, as is the standard criterion for partitional clustering [45]. For instance consider Table 1 where error rates for USPS symbols 1, 8, 3, 9 are tabulated. We follow our procedure outlined in Sec. III and find the optimal partition that minimizes Eq. (8) agnostic to the correspondence between samples and symbols. Errors are then reported by looking at mis-associations.

For SSL experiments we randomly pick labeled points among imbalanced sampled data, guaranteeing at least one labeled point from each class. SSL algorithms such as RCut-based GRF and NCut-based GTAM are applied on parameterized graphs built from partially labeled data and generate various partitions. Again we follow our procedure outlined in Sec. III and find the optimal partition that minimizes Eq. (8) agnostic to ground truth labels. Then labels for unlabeled data are predicted based on the selected partition and compared against the unknown true labels to produce the error rates.

Time complexity: RMD graph construction has time complexity (similar to the -NN graph). Computing cut value and checking cluster size for a partition takes . So if graphs are parameterized in total and the complexity of the learning algorithm is , the overall time complexity is .

Tuning parameters: Note that parameters including that characterize the graphs are variables to be optimized in Eq. (8). The remaining parameters are (a) in the baseline graph which is fixed to be , (b) imbalanced size threshold which is a priori fixed to be about , i.e., 5% of all samples.

Evaluation against oracle: To evaluate the effectiveness of our framework and the RMD parameterization, we compare against an oracle that is tuned to both ground truth labels as well as imbalance proportions.

V-a Synthetic illustrative data clustering example

(a) -NN (b) -matching (c) -graph (full-RBF) (d) RMD
Fig. 4: Clustering results of 3-partition SC on the two crescents and one Gaussian dataset. SC on -graph completely fails due to the outlier. For -NN and -matching graphs SC cannot recognize the long winding low-density regions between the two crescents and fails to find the rightmost small cluster. Our method sparsifies the graph at low-density regions, allowing to cut along the valley, detect the small cluster and is robust to outliers.

We consider a multi-cluster complex-shaped dataset which is composed of 1 small Gaussian and 2 crescent-shaped proximal clusters that is shown in Fig. 4. We have a sample size of with the rightmost small cluster formed by of the samples and two crescents each. This example is only for illustrative purpose with a single run, so we did not parameterize the graph or apply the optimization step (3) in the framework. We fix and choose , , where is the average -NN distance. Model-based approaches can fail on such a dataset due to the complex shapes of clusters. The 3-partition SC based on RCut is applied. We observe in Fig. 4 that on -NN and -matching graphs SC fails for two reasons: (1) SC cuts at balanced positions and cannot detect the rightmost small cluster, (2) SC cannot recognize the long winding low-density regions between the two crescents because there are too many spurious edges and the cut value along the curve is large. SC fails on the -graph (similar to full-RBF) because the outlier point forms a singleton cluster and also cannot recognize the low-density curve. Our RMD graph significantly sparsifies the graph at low-densities, enabling SC to cut along the valley, detect small clusters and reject outliers.

V-B Real experiments with similarity networks

We focus on imbalanced settings for several real datasets. We construct -NN, -matching, full-RBF and RMD graphs all combined with RBF weights, but do not include the -graph because of its overall poor performance [21]. Our sample size varies from 750 to 1500. We discretize not only but also and to parameterize graphs. We vary in . While small may lead to disconnected graphs this is not an issue for us since singleton cluster candidates are ruled infeasible in PCut. Also notice that for , RMD graph is identical to the -NN graph. For RBF parameter it has been suggested to use a value on the same scale as the average -NN distance [6]. This suggests a discretization of with . We discretize with steps of .

In the model selection step Eq. (8), cut values of various partitions are evaluated on the same -NN graph with , before selecting the min-cut partition. The true number of clusters/classes is assumed known. We assume meaningful clusters are at least of the total number of points, i.e. . We set the GTAM parameter as in [21] for the SSL tasks and each time 20 randomly labeled samples are chosen with at least one sample from each class.

(a) SC (clustering) (b) GRF (SSL) (c) GTAM (SSL)
Fig. 5: Error rates of SC and SSL algorithms on USPS 8vs9 with varying levels of imbalancedness. Our RMD scheme remains competitive when the data is balanced and adapts to imbalancedness much better than traditional graph constructions.

Varying imbalancedness:
We use the digits 8 and 9 in the 256-dim USPS digit dataset and randomly sample 750 points with different levels of imbalancedness. Normalized SC, GRF and GTAM are then applied. Fig. 5 shows that when the underlying clusters/classes are balanced, our RMD method performs as well as traditional graphs. As the imbalancedness increases the performance of other graphs degrades, while our method can adapt to different levels of imbalancedness.

Datasets #samples per cluster
2-cluster (e.g. USPS 8/9) 150/600
3-cluster (e.g. SatImg 3/4/5) 200/400/600
4-cluster (e.g. USPS 1/8/3/9) 200/300/400/500
TABLE I: Imbalancedness of datasets.
Error Rates (%) USPS SatImg OptDigit LetterRec
8vs9 1,8,3,9 4vs3 3,4,5 1,4,7 9vs8 6vs8 1,4,8,9 6vs7 6,7,8
RBF -NN (BO) 33.20 17.60 15.76 22.08 25.28 15.17 11.15 30.02 7.85 38.70
RBF -NN 16.67 13.21 12.80 18.94 25.33 9.67 10.76 26.76 4.89 37.72
RBF -match 17.33 12.75 12.73 18.86 25.67 10.11 11.44 28.53 5.13 38.33
full-RBF 19.87 16.56 18.59 21.33 34.69 11.61 15.47 36.22 7.45 35.98
full-aRBF 18.35 16.26 16.79 20.15 35.91 10.88 13.27 33.86 7.58 35.27
RBF RMD 4.80 9.66 9.25 16.26 20.52 6.35 6.93 23.35 3.60 28.68
RBF RMD (O) 3.13 7.89 8.30 14.19 18.72 5.43 6.27 19.71 3.02 25.33
TABLE II: Error rates of normalized SC on various graphs for imbalanced real datasets. Our method performs significantly better than other methods. First row (“BO” Balanced Oracle) shows RBF -NN results on imbalanced data with tuned using ground truth labels but on balanced data. Last row (“O” Oracle) shows the best oracle results of RBF RMD on imbalanced data.

Other real datasets:
We apply SC and SSL algorithms on several other real datasets including USPS (256-dim.), Statlog landsat satellite images (4-dim.), letter recognition images (16-dim.) and optical recognition of handwritten digits (16-dim.) [46]. We sample the datasets in an imbalanced manner as shown in Table I.

In Table II, the first row is the imbalanced results of RBF -NN using oracle with parameters tuned with ground-truth labels on balanced data for each dataset (300/300, 250/250/250, 250/250/250/250 samples for 2,3,4-class cases). Comparison of the first two rows reveals that the oracle choice on balanced data may not be suitable for imbalanced data, while our PCut framework, although agnostic, picks more suitable for RBF -NN. The last row presents oracle results on RBF RMD tuned to imbalanced data. This shows that our PCut on RMD, agnostic of true labels, closely approximates the oracle performance. In addition both tables show that our RMD graph parameterization performs consistently better than other methods. Similarly, Table III shows the performance of different graph constructions for SSL tasks with GRF and GTAM algorithms. We again observe that the RMD graph construction performs significantly better than all other constructions in the SSL tasks in all datasets.

Error Rates (%) USPS SatImg OptDigit LetterRec
8vs6 1,8,3,9 4vs3 1,4,7 6vs8 8vs9 6,1,8 6vs7 6,7,8
GRF RBF -NN 5.70 13.29 14.64 16.68 5.68 7.57 7.53 7.67 28.33
RBF -matching 6.02 13.06 13.89 16.22 5.95 7.85 7.92 7.82 29.21
full-RBF 15.41 12.37 14.22 17.58 5.62 9.28 7.74 11.52 28.91
full-aRBF 12.89 11.74 13.58 17.86 5.78 8.66 7.88 10.10 28.36
RBF RMD 1.08 10.24 9.74 15.04 2.07 2.30 5.82 5.23 27.24
GTAM RBF -NN 4.11 10.88 26.63 20.68 11.76 5.74 12.68 19.45 27.66
RBF -matching 3.96 10.83 27.03 20.83 12.48 5.65 12.28 18.85 28.01
full-RBF 16.98 11.28 18.82 21.16 13.59 7.73 13.09 18.66 30.28
full-aRBF 13.66 10.05 17.63 22.69 12.15 7.44 13.09 17.85 31.71
RBF RMD 1.22 9.13 18.68 19.24 5.81 3.12 10.73 15.67 25.19
TABLE III: Error rate performance of GRF and GTAM for imbalanced real datasets. Our method performs significantly better than other methods.

V-C Community detection with connectivity networks

In this section we consider the adaptation of RMD to connectivity networks as described in Sec. III-B. We first consider performance on synthetic networks using the stochastic block model for graph generation with two clusters of sizes and nodes with imbalance coefficient . The two clusters each follow an Erdős-Rényi model with edge probabilities and respectively and inter-cluster edge probabilities .

We first consider an illustrative example in how affects the cut value and clustering error with fixed imbalance , , and in Fig. 6 over 20 generated graphs. is chosen such that the expected degree of each node is equal, in order to prevent clustering using node degrees. We first observe that the cut value is a good indicator for clustering performance for different , i.e. the parameter value chosen by Eq. (8) also minimizes the clustering error (all shown cuts satisfied the size constraint with ). We also observe that the parameter that minimizes cut value decreases the clustering error from about in the baseline case (which performs SC on the given graph and corresponds to ) to about , representing an decrease.

Fig. 6:

Average clustering error and normalized cut values with standard deviation error bars for different parameterizations

, computed over 20 simulated graphs. The parameter that minimizes cut value also minimizes clustering error and provides an decrease. in clustering error on average, compared to the baseline given by SC on the original graph ().

Using the same graph generation model, we next investigate the effect of RMD on the performance of community detection for different imbalance coefficients and graph parameters in Fig. 7. We again consider nodes with imbalance varying between and . We set and scaling proportionally with which we normalize such that it is equal to when . We choose such that the expected node degrees are uniform as in the previous example. To obtain the clustering with RMD we optimize over the interval with increments and choose the parameter that minimizes the cut value, as in Eq. (8). We observe that RMD does not provide significant performance improvements for balanced cluster sizes, however it performs significantly better compared to SC on the baseline graph for imbalanced cluster sizes as expected. We also remark that in Fig. 7 the reason the error reduction factor exceeds 1 at times is the mismatch between the parameters that minimize the cut value and the clustering error (which should ideally be less than or equal to SC error, since is in the parameter set).

Fig. 7: Average clustering error for SC and RMD with standard deviation error bars for different imbalance parameters on the left figure. Right figure shows the average and standard deviation error bars for RMD error to SC error ratio. Both statistics computed over 20 simulated graphs for each . While RMD does not provide much gain in performance when , for smaller imbalance factors we observe an up to reduction in clustering error.

Performance on real world examples:
In this section we consider two social network datasets with well-established community structures. The first dataset we consider is the network from Zachary’s karate club study [47] which is widely used to evaluate community detection methods [48]. Zachary observed 34 members of a karate club over two years where the group split into two separate clubs after a disagreement. These two clubs constitute the two ground truth communities of 16 and 18 nodes, around the instructor (node 1) and administrator (node 34) of the club respectively. The network with nodes corresponding to members and edges corresponding to binary friendship indicators as determined by Zachary is illustrated in the left figure of Fig. 8, where the coloring indicates the eventual membership of the two clubs.

Evaluating the baseline SC method and PCut using RMD on the network, we observe that only node 3 is misattributed to the wrong community resulting in good performance in both cases, which is to be expected on the relatively small and simple dataset. We then consider an under-observed and more imbalanced version of the network illustrated on the right figure of Fig. 8, where we removed 8 outlying nodes on the blue community with node numbers 15, 16, 19, 21, 23, 24, 27 and 30. Evaluating SC on the reduced graph, we observe 10 misattributed nodes with node numbers 2, 3, 4, 8, 12, 13, 14, 18, 20 and 22, while RMD is successful in recovering all but node 3’s community attribution as in the full graph using minimum cluster size parameter .

Fig. 8: Visualization of Zachary’s karate club social network on the left figure. Right figure illustrates the under-observed network that we consider with less balanced community sizes.

The second dataset we consider is the network of 62 bottlenose dolphins living in Doubtful Sound, New Zealand analyzed by [49], which is another dataset widely used for benchmark purposes [48]. The edges in the network were determined by sightings of pairs of dolphins and the two ground truth communities of sizes 20 and 42 correspond to dolphins that separated after a dolphin left the area for some time. As the communities are internally well-connected with internal cliques and only six edges between the communities, SC is successful in recovering all but one node association. For a meaningful comparison, we again consider an under-observed version of the graph with nodes randomly removed from the small community. We consider 1 to 12 removed nodes with 100 different samplings each and illustrate the error rates for SC and RMD in Fig. 9. We observe that as the number of removed nodes increases, the graph is more imbalanced and the error rate of both methods increase. However, PCut with RMD performs consistently better on more imbalanced cluster sizes, with up to 40% decrease in average error rates.

Fig. 9: Average clustering error for bottlenose dolphins network with randomly removed number of nodes varying from 1 to 12. Disconnected nodes that are left after removing the initial nodes are also removed. Minimum community size parameter is selected as for PCut with RMD.

Finally, we evaluate our method on experiments performed on graphs generated by the LFR benchmark algorithm [50]. The benchmark algorithm accounts for the power-law behavior of both degree and community size distributions in real networks, resulting in more realistic network structures compared to SBM that we considered before. We refer the reader to [50] for more details on the algorithm. While the previous two real networks we considered were small in size and had two communities, with these benchmarks we will investigate the behavior of PCut with RMD in networks with a larger number of nodes and communities. For the experiments in this section we consider 200 nodes, degree distribution parameter , average degree , , mixing parameter , community size distribution parameter and maximum community size . We vary the minimum community size parameter between 10 and 50 in 10 logarithmic increments and sample 100 graphs each to get a good spread of imbalanced and balanced communities, obtaining 1000 graphs in total. We note that results in an exactly balanced network with 4 communities of 50 nodes each. We also remark that the number of communities is variable and random, and increases with decreased , with as many as 12 communities for small . We choose minimum partition size parameter for all graphs. To compute the error of a given partitioning, we use the Hungarian algorithm [51] that finds the optimal permutation of found community labels to ground truth labels by solving a minimum weighted bipartite matching problem, with assignment cost between two communities , set equal to .

(a) SC (b) PCut with RMD (c) Absolute improvement
Fig. 10: (a) Scatter plot of clustering error for baseline SC, (b) for PCut with RMD for 1000 generated graphs. (c) Box plot of absolute improvement in error rate from SC to RMD, ignoring samples with zero SC and RMD error. The x-axis in all graphs is the minimum community size in a given graph, while y-axis is the clustering error. In all three plots, minimum community sizes are quantized to bins with a width of 5 such that the first group in the plots corresponds to minimum community sizes of 10-14, second group of 15-19 and last group of only 50. Jittering is also added to the points on the horizontal axis to better visualize overlapping values in scatter plots (a,b).

Using the minimum community size in a generated graph as an indicator for imbalancedness of community sizes, we plot clustering error vs. minimum community size for SC and PCut with RMD in Fig. 10. We observe a distinct bimodal behavior in the error rates for SC, where there is a significant number of samples with approximately zero error (the fraction of which increases with the minimum community size) and the rest of the samples follow a roughly linearly increasing error rate. This error behavior is due to the algorithm merging separate clusters and separating a cluster to multiple clusters in certain cases, as consistent with the error rates being near the cluster sizes for the balanced case (minimum size close to 50). We illustrate this with one of the generated balanced networks with SC error in Fig. 11. We see that SC has merged the top two clusters together while separating the bottom-left cluster to two, causing the high error. However, the optimal PCut corresponding to RMD graph with has emphasized the separation of the four communities, leading to the correct assignment of nodes. While RMD graphs provide error improvement for imbalanced clusters as expected, we see that it can solve the cluster merging and separating problems by emphasizing the separation between clusters, reducing the error rate to zero. We can observe from Fig. 10 that PCut improves upon SC with a median absolute improvement in the error rate for imbalanced graphs, to up to median absolute improvement in balanced cases.

(a) (b) (c)
Fig. 11: Example network generated by LFR with exactly balanced communities. (a) Ground truth communities, (b) SC estimation, (c) PCut estimation on RMD graph with . RMD parameterization emphasizes the separation between communities and prevents spectral clustering from merging the two communities on the top.

Vi Conclusion

In this paper we investigated the performance of spectral clustering based on minimizing RCut or NCut objectives for graph partitioning with imbalanced partition sizes and showed that these objectives may lead to poor clustering performance in both similarity and connectivity network modalities. To this end we proposed the partition constrained min-cut (PCut) framework, which seeks min-cut partitions under minimum cluster size constraints. Since constrained min-cut is NP-hard, we adopt existing spectral methods (SC, GRF, GTAM) as a black-box subroutine on a parameterized family of graphs to generate candidate partitions and solve PCut on these partitions. We proposed rank modulated degree (RMD) graphs as a rich graph parameterization based on adaptively modulating the node degrees in varying levels to adapt to different levels of imbalanced data. Our framework automatically selects the parameters based on PCut objective, and can be used in conjunction with other graph-based partition methods such as 1-spectral clustering, Cheeger cut or sparsest cut [24, 40, 41, 42]. Our idea is justified through limit cut analysis and both synthetic and real experiments on clustering and SSL tasks for the similarity networks, and community detection for connectivity networks.

Acknowledgments

This material is based upon work supported in part by the U.S. Department of Homeland Security, Science and Technology Directorate, Office of University Programs, under Grant Award 2013-ST-061-ED0001, the National Science Foundation under Grants CCF-1320547 and 1218992, and by ONR Grant 50202168. The views and conclusions contained in this document are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the U.S. Department of Homeland Security, NSF and ONR.

Vii Appendix: Proofs of Theorems

For convenience in analysis, let and divide data points into sets , where and each , contains points. is used to generate the statistic for and for . is used to compute the rank of with the formula

(14)

We analyze and prove our result for the statistic of the form

(15)

where we used in place of and denotes the distance from to its -th nearest neighbor among points in . In practice we can omit the weight and use the average of first to -th nearest neighbor distances as described in Sec. III.

Proof of Theorem IV.1:

The proof involves two steps:

  • The expectation of the empirical rank is shown to converge to as .

  • The empirical rank is shown to concentrate at its expectation as .

The first step is shown through Lemma VII.2. For the second step, notice that the rank , where is independent across different ’s, and . By Hoeffding’s inequality, we have

(16)

Combining these two steps completes the proof.

Proof of Theorem IV.2:

We want to establish convergence results for the cut term and the balancing terms respectively, that is:

(17)
(18)
(19)

where are the discrete versions of .

The balancing terms Eq. (18), (19

) are obtained similarly using Chernoff bound on the sum of binomial random variables, since the number of points in

is binomially distributed

. Details can be found in [20].

Eq. (17) is established in two steps. First we can show that the LHS cut term converges to its expectation by McDiarmid’s inequality. Second we show this expectation term actually converges to the RHS of Eq. (17). This is shown in Lemma VII.1.

Lemma VII.1

Given the assumptions of Theorem 2,

where .

Proof:

The proof is an extension of [52]. For simplicity of exposition we provide an outline of the extension and further details can be found in [20]. The first trick is to define a cut function for a fixed point whose expectation is easier to compute,

(20)

is defined similarly for . The expectation of and can be shown to satify

(21)

Then the value of can be computed as

(22)

where is the distance of to its -th nearest neighbor. The value of is a random variable and can be characterized by the CDF . Combining above with Eq. (21) we can write down the whole expected cut value

(23)

where to simplify the expression we used to denote

(24)

Under general assumptions, the random variable will highly concentrate around its mean when tends to infinity. Furthermore, as , tends to zero and the speed of convergence is given by

(25)

Therefore the inner integral in the cut value can be approximated by , which implies that

(26)

The next trick is to decompose the integral over into two orthogonal directions, i.e. the direction along the hyperplane and its normal direction

where we used to denote the unit normal vector. When , the integral region of will be empty, i.e. . On the other hand, when is close to , we have the approximation

The term is the volume of the -dimensional spherical cap of radius , which is at distance to the center. Through direct computation we obtain

Combining the above step and plugging in the approximation of in Eq. (25), we finish the proof.

Lemma VII.2

By choosing properly, it follows that as ,

Proof:

Taking the expectation with respect to we have

The last equality holds due to the i.i.d. symmetry of and . We fix both and and temporarily disregard . Let , where are the points in . It follows that:

To check McDiarmid’s requirements, we replace with . It is easily verified that ,