An Exact No Free Lunch Theorem for Community Detection

03/25/2019 ∙ by Arya D. McCarthy, et al. ∙ 0

A precondition for a No Free Lunch theorem is evaluation with a loss function which does not assume a priori superiority of some outputs over others. A previous result for community detection by Peel et al. (2017) relies on a mismatch between the loss function and the problem domain. The loss function computes an expectation over only a subset of the universe of possible outputs; thus, it is only asymptotically appropriate with respect to the problem size. By using the correct random model for the problem domain, we provide a stronger, exact No Free Lunch theorem for community detection. The claim generalizes to other set-partitioning tasks including core/periphery separation, k-clustering, and graph partitioning. Finally, we review the literature of proposed evaluation functions and identify functions which (perhaps with slight modifications) are compatible with an exact No Free Lunch theorem.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A myriad of tasks in machine learning and network science involve discovering structure in data. Especially as we process graphs with millions of nodes, analysis of individual nodes is untenable, while global properties of the graph ignore local details. It becomes critical to find an intermediate level of complexity, whether it be communities, cores and peripheries, or other structures. Points in metric space and nodes of graphs can be clustered, and hubs identified, using algorithms from network science. A longstanding theoretical question in machine learning has been whether an ultimate clustering algorithm is a possibility or merely a fool’s errand.

Largely, the question was addressed by Wolpert (1996) as a No Free Lunch theorem, a claim about the limitations of algorithms with respect to their problem domain. When an appropriate function is chosen to quantify the error (or loss), no algorithm can be superior to any other: an improvement across one subset of the problem domain is balanced by diminished performance on another subset. This is jarring at first. Are we not striving to find the best algorithms for our tasks? Yes—but by making specific assumptions about the subset of problems we expect to encounter, we can be comfortable tailoring our algorithms to those problems and sacrificing performance on remote cases.

(a) Non-spherical clusters
(b)

Unequal variances

Figure 1: -means clustering when certain assumptions are violated.

As an example, the -means algorithm for -clustering is widely used for its simplicity and strength, but it assumes spherical clusters, equal variance in those clusters, and similar cluster sizes (equivalent to a homoscedastic Gaussian prior). Figure 1 shows the degraded performance on problems where these assumptions are violated.

To prove a No Free Lunch theorem for a particular task demands an appropriate loss function. A No Free Lunch theorem was argued for community detection (Peel et al., 2017), using the adjusted mutual information function (Vinh et al., 2009).111Throughout this work, we assume that we evaluate against a known ground truth, as opposed to some intrinsic measure of partition properties like modularity (Newman and Girvan, 2004). However, the theorem is inexact. A No Free Lunch theorem relies on a loss function which imparts generalizer-independence (formally defined below): one which does not assume a priori that some prediction is superior to another. The loss function used in the proof is only asymptotically independent in the size of the input. We present a correction: by substituting an appropriate loss function, we are able to claim an exact version of the No Free Lunch theorem for community detection. The result generalizes to other set-partitioning tasks when evaluated with this loss function, including clustering, -clustering, and graph partitioning.

2 Background

2.1 Community detection

A number of tasks on graphs seek a partition of the graph’s nodes that maximizes a score function. Situated between the microscopic node-level and the macroscopic graph-level, these partitions form a mesoscopic structure—be it a core–periphery separation, a graph coloring, or our focus: community detection (CD). Community detection has been historically ill-defined (Radicchi et al., 2004; Yang et al., 2016), though the intuition is to collect nodes with high interconnectivity (or edge density) into communities with low edge density between them. The task is analogous to clustering, in which points near one another in a metric space are grouped together.

To assess whether the formulation of community detection matches one’s needs, one performs extrinsic evaluation against a known ground truth clustering. This ground truth can come from domain knowledge of real-world graphs or can be planted into a graph as a synthetic benchmark. After running community detection on the graph, some similarity or error measure between the computed community structure and the correct one can be computed.

No bijection between true structure and graph

Unfortunately, ground truth communities do not imply a single graph—and vice versa. Peel et al. (2017) go as far as to claim, Searching for the ground truth partition without knowing the exact generative mechanism is an impossible task.

We can imagine the following steps for how problem instances are created, given that we have nodes:

  1. Sample (true) partition ;

  2. Generate graph from by adding edges according to the edge-generating process .

where is our universe: the space of all partitions of objects. Given a graph , we can imagine multiple truths that could define its edge set by different generative processes , where is the set of all graphs with nodes. Peel et al. (2017) give a proof that extends from this simple example: Imagine that partitions the nodes into components (the -partition), and partitions them into component (the -partition). Let exactly specify the number of edges between each pair of communities, such that is

with probability 1. Similarly, let

be an Erdős–Rényi model such that is with nonzero probability. (Peel et al. (2017) note that this is easily extended to graphs with more nodes.) We thus have two different ways to create a single graph; how can a method discern the correct one, without knowledge of ?

Community detection is then an ill-posed inverse problem: Use a function to produce a clustering , which is hopefully representative of (Peel et al., 2017, Appendix C).222That is, the objective is to find . The function is not a bijection, so there isn’t a unique represented in the given graph. Our algorithm  must encode our prior beliefs about the generative process  to select from among candidates. For this reason, we must hope that the benchmark graphs that we use are representative of the generative process for graphs in our real-world applications. That is, we hope that our benchmark domain matches our practical domain.

Other set-partitioning tasks

While the remainder of this work focuses on community detection, our claims are relevant to other set-partitioning tasks. Notable examples are clustering (the vector space analogue to community detection), graph

-partitioning, and -clustering. Metadata about the nodes and edges, such as vector coordinates, are used to guide the identification of such structure, but the tasks are all fundamentally set-partitioning problems. They can also have different universes —the latter tasks have a smaller universe than does community detection, for a given graph : They consider only partitions with a fixed number of clusters.

2.2 No Free Lunch theorems

The No Free Lunch theorem in machine learning is a claim about the universal (in)effectiveness of learning algorithms. Every algorithm performs equally well when averaging over all possible input–output pairs. Formally, for any learning method , the error (or loss of the method , summed over all possible problems equals a loss-specific constant :

(1)

defining the edge-generative process and partition as above. This loss is thus generalizer-independent. To reduce loss on a particular set of problems means sacrificing performance on others—there is no free lunch (Wolpert, 1996; Schumacher et al., 2001). Judiciously choosing which set to improve involves making assumptions about the distribution of the data: as we’ve mentioned, -means is a method for -clustering which works well on data with spherical covariance, similar cluster sizes, and roughly equal class sizes. When these assumptions are violated, performance suffers and overall balance is achieved.

2.3 Community detection as supervised learning

We follow Peel et al. (2017) in framing the task of community detection (CD) as a learning problem. While recent algorithms, e.g. Chen et al. (2019), have introduced learnable parameters to community detection algorithms the CD literature’s algorithms are by and large untrained. These untrained algorithms encode knowledge of the problem domain in prior beliefs. We note that our work and Peel et al. (2017) straightforwardly handle both of these cases.

In general supervised machine learning problems, we seek to learn the function that maps an input space to an output space

. We consider problem instances as sampled from random variables over each, so our goal is to learn the conditional distribution 

. In the process of training on a dataset , we develop a distribution over hypotheses 

which are estimates of the distribution 

.

In the case of most community detection algorithms, our input space is the set of graphs on nodes , and the output space is . There is no training data: . All of our prior beliefs about must be encoded in the prior distribution . That is, the model itself must contain our beliefs about the definition of community structure. Only from the encoded and an observed (our graph ) do we form our point estimate of the true distribution  (Peel et al., 2017). However, in the case of trainable CD algorithms, we encode our beliefs in the posterior distribution .

2.4 Loss functions and a priori superiority

How should we evaluate an algorithm’s predictions? Classification accuracy won’t cut it: When comparing to the ground truth, there are no specific labels (e.g. no notion of a specific Cluster 2)—only unlabeled groups of like entities. We settle for a measure of similarity in the groupings, quantifying how much the computed partition tells us about the ground truth.

A popular choice of measure is the normalized mutual information (NMI; Kvalseth, 1987) between the prediction and the ground truth. While this measure has a long history in community detection, its flaws have been well-noted (Vinh et al., 2009; Peel et al., 2017; McCarthy and Matula, 2018; McCarthy et al., 2019). It imposes a geometric structure upon the universe ,333To take the example of Peel et al. (2017), loss (squared Euclidian distance) imposes a geometric structure: In the task of guessing points in the unit circle, guessing the center will garner a higher reward, on average, than any other point. so something as simple as guessing the trivial all-singletons clustering outperforms methods that try at all to find a mesoscopic-level structure (McCarthy et al., 2019). The property which NMI lacks is generalizer-independence.

The property of generalizer-independence is defined by the generalization error function, an expectation of the loss . To satisfy this property, the generalization error must be independent of the particular true value . This is best expressed by Equation 1.

The adjusted mutual information (AMI, defined in section 3) (Vinh et al., 2009) is a proposed replacement for NMI which does not impose a geometric structure upon the space. Unfortunately, this benefit is not fully realized when the expectation is computed over a space . For the used in Peel et al. (2017), the expected AMI across all problems is only asymptotically generalizer-independent as the graph size grows—it is within some diminishing amount of error of generalizer-independence, as proven by Peel et al. (2017).

3 Previous Result: Approximate No Free Lunch Theorem

Peel et al. (2017) frame community detection in the style of learning algorithms, letting them prove a No Free Lunch theorem for community detection. They note that the claim holds for an appropriate choice of …—specifically a loss function  that is generalizer-independent—but their chosen loss function is not fully generalizer-independent. They also consider a stricter property than generalizer-independence: homogeneity. With a homogeneous loss function, the distribution of the error (not just its expectation) is identical, regardless of the ground truth. A measure which deviates from homogeneity may have this deviation bounded by a function of the number of vertices (the graph order).

Lemma 1 (Peel et al., 2017).

Adjusted mutual information (AMI) is a homogeneous loss function over the interior of the space of partitions of objects, i.e., excluding the -partition and the -partition. Including these, AMI is homogeneous within .444 is the -th Bell number, i.e., the number of partitions of a set of nodes.

Wolpert (1996) gives a generalized No Free Lunch theorem, which assumes a homogeneous loss.

Theorem 1 (Wolpert, 1996).

For homogeneous loss , the uniform average over all distributions of equals . (Plainly, there is no free lunch.)

Peel et al. (2017) then use Wolpert’s result with their inexactly homogeneous measure to claim a No Free Lunch result.

Theorem 2 (Peel et al., 2017).

By Lemma 1 and Theorem 1, for the community detection problem with a loss function of AMI, the uniform average over all distributions of equals .

But this choice of measure (AMI) is not, in fact, homogeneous over the entire universe (Lemma 1). A strategy that guesses either of the non-interior (i.e., boundary) partitions—the -partition or -partition—will yield a higher-than-average reward. There is indeed a negligible amount of free lunch—a free morsel, if you will.

4 Diagnosis: Random Models

(a) Ground truth has cluster size pattern .
(b) Ground truth has cluster size pattern .
Figure 2: and when clustering three nodes, for two different ground truths (circled). The top and bottom clusterings—the and clusterings—are the boundary partitions. All other partitions form the interior. changes based on the ground truth, but stays the same.

Peel et al. (2017) use AMI out of the box, as proposed by Vinh et al. (2009), which involves subtracting an expected value from a raw score. Unfortunately, AMI as given takes its expectation over the wrong distribution. Because of the mismatch, Peel et al. (2017)’s claim of homogeneity is accurate only to within when considering the trivial partitions into either one community or  communities.

Correcting this is arguably a pedantic demand, for two reasons:

  1. The fraction converges to 0 superexponentially as increases.

  2. The deficiency is only present when is one of the trivial partitions. Otherwise, AMI as used is exactly homogeneous. But the trivial partitions reflect a lack of any mesoscopic community structure.

Nevertheless, we’d like to see a tight claim of generalizer independence. To do this, we must select the proper random model, a sample space for a distribution.

AMI adjusts NMI by subtracting the expected value from both the numerator and the denominator, shown in blue:

(2)

where is the mutual information, maximized when the specific clustering  equals the ground truth . By inspecting Equation 2, we see that AMI’s value is (the maximum) when , in expectation, and negative when the agreement between and is worse than chance.

Subtly hidden in this equation is the decision of which distribution to compute the expectation over. For decades, this distribution has been what Gates and Ahn (2017) call : all partitions of the same partition shape555A multiset of cluster sizes, also called a decomposition pattern (Hauer and Kondrak, 2016) or a group-size distribution (Lai and Nardini, 2016). It is equivalent to an integer partition of . as or . For example, if partitioned 7 nodes into clusters of sizes 2, 2, and 3, then we would compute the expected mutual information over all clusterings where one had cluster sizes of 2, 2, and 3.

McCarthy et al. (2019) argue that is inappropriate. To use this random model assumes that we can only produce outputs within that restricted space, when in actuality is the set of all partitions of nodes. Furthermore, during evaluation, we hold our ground truth fixed, rather than marginalizing over possible ground truths. Were we to instead consider a distribution over s, we would add noise from other possible generative processes which yield the same graph from different underlying partitions. In our average, we might be including scores on ground truths that better align with our notions of, say, core–periphery partitioning. For this reason, we take a one-sided expectation—over , holding fixed. The one-sided distribution over all partitions of nodes is called Gates and Ahn (2017). This distribution is what we use for our AMI expectation, giving a measure denoted as , which is recommended by McCarthy et al. (2019). It takes the form

(3)

The differences between and are illustrated in Figure 2 under . We will now show that substituting for , hence using , allows for an exact No Free Lunch theorem.

5 An Exact No Free Lunch Theorem

We strengthen the No Free Lunch theorem for community detection given by Peel et al. (2017) by using an improved loss function, , for community detection. Our proof does not distinguish the boundary partitions (the two trivial partitions) from the interior partitions (the remainder). It is entirely agnostic toward the particular ground truth , which is exactly what we need. We improve the previous result by moving from (which excludes the boundary partitions) to .

5.1 Generalizer-independence of

Lemma 2.

is a generalizer-independent loss function over the entire space of partitions of  objects.

Proof.

Like Peel et al. (2017), we must show that the sum of scores is independent of :

(4)

where  is the space of all partitions of  objects. Unlike Peel et al. (2017), we take the AMI expectation over all clusterings in using the random model (Gates and Ahn, 2017).

To prove our claim about Equation 4, we note that denominator of is a constant with respect to  (Equation 3), so we can factor it out of the sum and restrict our attention to the numerator. This is because the max-term in the denominator is the constant (Gates and Ahn, 2017) and the expectation term for a given is independent of the particular . Having factored this out, we will now prove Equation 4 by the stronger claim:

(5)

To prove Equation 5, we separate the summation’s two terms:

(6)

The expectation is uniform over the universe ,666Why do we assume uniformity over ? Because this is the highest-entropy (i.e., least informed) distribution—it places the fewest assumptions on the distribution. so we can apply the law of the unconscious statistician, then push the constant probability out, to get

(7)

Because the inner sum is independent of any particular , the outer sum is a sum of constants—one for each element in . We can now express Equation 5 as follows, where the reciprocals straightforwardly cancel out:

(8)

This equivalence implies that Equation 4 is true. ∎

The proof is valid without loss of generality vis-à-vis the distribution—that is, as long as the AMI expectation is computed uniformly over the problem universe , AMI is a generalizer-independent measure. This stipulation is relevant to tasks which assume a fixed number of clusterings—using —like -clustering and graph partitioning.

Having demonstrated the generalizer-independence of AMI, we can define our loss function as, say,

(9)

The loss is zero when we exactly match the true clustering and positive otherwise.

Having proven the generalizer-independence of , we now turn to a more general form of the No Free Lunch theorem, which admits not just a homogeneous loss function but any generalizer-independent loss.

Theorem 3 (Wolpert, 1996).

For generalizer-independent loss , the uniform average over all , , equals . (Plainly, There is no free lunch.)

Proof.

See Wolpert (1996). ∎

Theorem 4 (No Free Lunch theorem for community detection and other set-partitioning tasks).

For a set-partitioning problem with a loss function of adjusted mutual information using the appropriate random model for the task, the uniform average over all , , equals .

Proof.

Lemma 2 proves that AMI using the appropriate random model is generalizer-independent. Applying Theorem 3 completes the proof (Peel et al., 2017). ∎

5.2 Other measures

AMI stemmed from a series of efforts to improve normalized mutual information (NMI). We note that six other measures, when extended to instead of , are also generalizer-independent: the adjusted Rand index (ARI; Hubert and Arabie, 1985), relative NMI (rNMI; Zhang, 2015), ratio of relative NMI (rrNMI; Zhang et al., 2015), Cohen’s (Liu et al., 2018), corrected NMI (cNMI Lai and Nardini, 2016), and standardized mutual information (SMI; Romano et al., 2014). We elide the proofs because they are similar to Lemma 2. Each of the six measures satisfies the precondition for the No Free Lunch theorem when the random model matches the problem domain.

Of late, a renewed push has advocated using the adjusted Rand index (ARI; Hubert and Arabie, 1985) to evaluate community detection; in fact, ARI and AMI are specializations of the same underlying function which uses generalized information-theoretic measures (Romano et al., 2016). Every claim in the proof works for ARI, by replacing every mutual information term with the Rand index .

Another line of research, focusing on improving NMI, produced rNMI Zhang (2015), rrNMI (Zhang et al., 2015), and cNMI (Lai and Nardini, 2016). We note that rrNMI is identical to one-sided AMI when both are extended to . Consequently, our claim above works just as well for rrNMI. Further, because we were able to ignore the denominator of AMI in our proof of Lemma 2, we can do the same for rrNMI, which gives its unnormalized variant, rNMI. This means that rNMI is a generalizer-independent measure as well, when used in the appropriate one-sided random model. The practical benefit of normalizing rNMI into rrNMI is that the normalized measure gives a more interpretable notion of success.

Additionally, Lemma 2 holds true for standardized mutual information (which is equivalent to standardized variation of information and standardized V-measure) (Romano et al., 2014), the adjusted variation of information (Vinh et al., 2009), and for Cohen’s , advocated for CD by Liu et al. (2018). This is because each measure shares the form of AMI: an observed score minus an expectation.

Finally, to show whether cNMI is generalizer-independent under the correct random model, we must show how to specialize it into a one-sided variant, because there is room for interpretation about how this should be done, even restricting our focus to . The expression for cNMI

(10)

depends on both and relative to the universes that contain them. Our specialization should remove dependence on the family of , so we arrive at the following expression after cancellation and noting that the NMI between a clustering and itself is 1:

(11)

As it turns out, this quasi-adjusted measure is also generalizer-independent.

In general, we now have a recipe for generalizer-independent loss functions: They can be created by subtracting the expected score from the observed score. This recipe works whenever a uniform expectation can be well defined.

6 Conclusion

We now have a proof of the No Free Lunch theorem for community detection and clustering that is both complete and exact. We show that a corrected form of AMI, namely , computes its expectation in a way that does not advantage the boundary partitions ( cluster and singleton clusters). Indeed, this expectation is over the entire universe of partitions , rather than any proper subset, such as the historically common . We affirm the claim: Any subset of problems for which an algorithm outperforms others is balanced by another subset for which the algorithm underperforms others. Thus, there is no single community detection algorithm that is best overall (Peel et al., 2017).

It is still possible for an algorithm to perform better on a subset of community detection problems, so we can strive toward improved results on such a subset. To hope to perform well, we must note the assumptions about the subset of problems we expect to encounter. Some work has been done on estimating network properties to select the correct algorithm for the task at hand—a coarse way of checking assumptions (Peel, 2011; Yang et al., 2016). Beyond this, though, we must clarify what the problem of community detection is; the formulation we choose will guide which subset of problem instances to prioritize and which to sacrifice.

Acknowledgments

The authors thank, alphabetically by surname, Daniel Larremore, Leto Peel, David Wolpert, Patrick Xia, and Jean-Gabriel Young for discussions that improved the work. Any mistakes are the authors’ alone.

References

  • Chen et al. (2019) Zhengdao Chen, Lisha Li, and Joan Bruna. 2019. Supervised community detection with line graph neural networks. In International Conference on Learning Representations.
  • Gates and Ahn (2017) Alexander J. Gates and Yong-Yeol Ahn. 2017. The impact of random models on clustering similarity. Journal of Machine Learning Research, 18(87):1–28.
  • Hauer and Kondrak (2016) Bradley Hauer and Grzegorz Kondrak. 2016. Decoding anagrammed texts written in an unknown language and script. Transactions of the Association for Computational Linguistics, 4:75–86.
  • Hubert and Arabie (1985) Lawrence Hubert and Phipps Arabie. 1985. Comparing partitions. Journal of classification, 2(1):193–218.
  • Kvalseth (1987) T. O. Kvalseth. 1987. Entropy and correlation: Some comments. IEEE Transactions on Systems, Man, and Cybernetics, 17(3):517–519.
  • Lai and Nardini (2016) Darong Lai and Christine Nardini. 2016. A corrected normalized mutual information for performance evaluation of community detection. Journal of Statistical Mechanics: Theory and Experiment, 2016(9):093403.
  • Liu et al. (2018) X. Liu, H.-M. Cheng, and Z.-Y. Zhang. 2018. Evaluation of Community Structures using Kappa Index and F-Score instead of Normalized Mutual Information. ArXiv e-prints.
  • McCarthy et al. (2019) Arya D. McCarthy, Tongfei Chen, Rachel Rudinger, and David W. Matula. 2019. Metrics matter in community detection. CoRR, abs/1901.01354.
  • McCarthy and Matula (2018) Arya D. McCarthy and David W. Matula. 2018. Normalized mutual information exaggerates community detection performance. In SIAM Workshop on Network Science 2018, SIAM NS18, pages 78–79, Portland, OR, USA. SIAM.
  • Newman and Girvan (2004) M. E. J. Newman and M. Girvan. 2004. Finding and evaluating community structure in networks. Phys. Rev. E, 69:026113.
  • Peel (2011) Leto Peel. 2011. Estimating network parameters for selecting community detection algorithms. Journal of Advances of Information Fusion, 6:119–130.
  • Peel et al. (2017) Leto Peel, Daniel B. Larremore, and Aaron Clauset. 2017. The ground truth about metadata and community detection in networks. Science Advances, 3(5).
  • Radicchi et al. (2004) Filippo Radicchi, Claudio Castellano, Federico Cecconi, Vittorio Loreto, and Domenico Parisi. 2004. Defining and identifying communities in networks. Proceedings of the National Academy of Sciences, 101(9):2658–2663.
  • Romano et al. (2014) Simone Romano, James Bailey, Vinh Nguyen, and Karin Verspoor. 2014. Standardized mutual information for clustering comparisons: One step further in adjustment for chance. In International Conference on Machine Learning, pages 1143–1151.
  • Romano et al. (2016) Simone Romano, Nguyen Xuan Vinh, James Bailey, and Karin Verspoor. 2016. Adjusting for chance clustering comparison measures. The Journal of Machine Learning Research, 17(1):4635–4666.
  • Schumacher et al. (2001) C. Schumacher, M. D. Vose, and L. D. Whitley. 2001. The no free lunch and problem description length. In

    Proceedings of the 3rd Annual Conference on Genetic and Evolutionary Computation

    , GECCO’01, pages 565–570, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
  • Vinh et al. (2009) Nguyen Xuan Vinh, Julien Epps, and James Bailey. 2009. Information theoretic measures for clusterings comparison: Is a correction for chance necessary? In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pages 1073–1080, New York, NY, USA. ACM.
  • Wolpert (1996) David H. Wolpert. 1996. The lack of a priori distinctions between learning algorithms. Neural Computation, 8(7):1341–1390.
  • Yang et al. (2016) Zhao Yang, René Algesheimer, and Claudio J. Tessone. 2016. A comparative analysis of community detection algorithms on artificial networks. Scientific Reports, 6:30750 EP –.
  • Zhang et al. (2015) Junhao Zhang, Tongfei Chen, and Junfeng Hu. 2015. On the relationship between gaussian stochastic blockmodels and label propagation algorithms. Journal of Statistical Mechanics: Theory and Experiment, 2015(3):P03009.
  • Zhang (2015) Pan Zhang. 2015. Evaluating accuracy of community detection using the relative normalized mutual information. Journal of Statistical Mechanics: Theory and Experiment, 2015(11):P11006.