A Map Equation with Metadata: Varying the Role of Attributes in Community Detection

10/24/2018 ∙ by Scott Emmons, et al. ∙ University of North Carolina at Chapel Hill 0

As the No Free Lunch theorem formally states [1], algorithms for detecting communities in networks must make tradeoffs. In this work, we present a method for using metadata to inform tradeoff decisions. We extend the content map equation, which adds metadata entropy to the traditional map equation, by introducing a tuning parameter allowing for explicit specification of the metadata's relative importance in assigning community labels. On synthetic networks, we show how tuning for node metadata relates to the detectability limit, and on empirical networks, we show how increased tuning for node metadata yields increased mutual information with the metadata at a cost in the traditional map equation. Our tuning parameter, like the focusing knob of a microscope, allows users to "zoom in" and "zoom out" on communities with varying levels of focus on the metadata.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

As network science has found application in a variety of real-world systems, ranging from the biological to the technological, so too has community detection in networks received widespread attention and application Porter et al. (2009); Fortunato (2010); Fortunato and Hric (2016); Shai et al. (2017). Traditionally, community detection methods have focused solely on the topology of the network, optimizing an objective function defined on the network structure that captures a particular notion of community, such as intra-community edge density and inter-community edge sparsity. Many approaches, ranging from the statistical to the information theoretical, have been used for community detection, and tradeoffs between these approaches include describing extant links versus predicting missing links.

Since the No Free Lunch theorem mandates that community detection algorithms make tradeoffs Peel et al. (2017), algorithms can use node metadata to inform tradeoff decisions. For example, Newman and Clauset demonstrated that their stochastic block model approach can choose either to partition a middle school and high school social network into communities by grade or into communities by race, depending on the metadata of interest Newman and Clauset (2016). Similarly, Hric et al. Hric et al. (2016) developed an attributed SBM from a multilayer perspective, with the attribute layer modeling relational information between attributes. Stanley et al. Stanley et al. (2018a) considered a different graphical model relating connections and attributes, with assumptions on the attribute distributions, to develop a stochastic block model with multiple continuous attributes. The I-louvain method Combe et al. (2015) extends the well-known Louvain algorithm Blondel et al. (2008) for modularity maximization by including attributes in their “intertia-based modularity.” Yang et al. proposed CESNA Yang et al. (2013) to learn a set of propensities or affiliations for each node across all possible communities such that two nodes with similar community propensities have common connectivity and attributes. Peel et alPeel et al. (2017) established a statistical test to determine if attributes correlate with community structure, and they developed an SBM with flexibility in how strongly to couple attributes and community labels in the corresponding stochastic block model inference. In related work, Stanley et al. Stanley et al. (2018b)

propose a test statistic based on label propagation for the alignment of node attributes with connectivity patterns.

Most closely related to our approach is the content map equation proposed by Smith et al. Smith et al. (2016). The content map equation, as we will later describe in more detail, adds an additional term to the map equation Rosvall and Bergstrom (2008) that introduces entropy based on the metadata. This modification to the map equation encourages intra-module homogeneity of node metadata values. Our paper extends the content map equation, categorizing the different sources of entropy in the map equation into the “inter-module codebook,” “intra-module codebooks,” and “metadata codebooks.” Within this framework, we introduce a tuning parameter to the metadata codebooks that can be interpreted as a relative cost of accessing a Shannon channel. Similar to how focusing knobs are an essential feature of a microscope, adding a tuning parameter to the content map equation is essential to its function, allowing one to “zoom in” and “zoom out” on communities with varying levels of focus on the metadata.

Ii Methodology

The map equation frames the problem of community detection as minimizing the description length of a random walk on the network Rosvall and Bergstrom (2008). In developing a code to compress the description of the random walk, the map equation necessitates that each codeword corresponds to an identifiable entity in the graph. It designates codewords for hard partitions of nodes into modules, codewords for individual nodes, and codewords for a special “exit” keyword for each module. As the codeword for a given node needs only to be unique within that node’s module, the module names and node names function like geographic city names and street names. The output of the map equation is a sort of “map,” optimized for data compression, that captures patterns in the data.

The map equation’s entropy arises from two different types of codebooks. The “inter-module codebook,” consisting of module names, describes movement between modules. The “intra-module codebooks,” consisting of node names and special “exit” codewords, describe movement within modules. The sum of the entropies of these codebooks, weighted by their relative frequencies, gives the per-step average number of bits needed to describe an infinite random walk on the network for a given partition M of the nodes into modules:

Here, is the entropy of the inter-module codebook, used with relative frequency , and is the entropy of the intra-module codebook for module , used with relative frequency .

The traditional map equation is concerned solely with topology; only the path of the random walker must be encoded. To extend the map equation to networks annotated with metadata, we additionally require that the value of the metadata at each step of the random walk be encoded. The game of the encoder is to encode at which node a random walker is at each step of the random walk. Like in the traditional map equation, the encoder must report whenever the random walker changes modules. Additionally, the encoder must report the value of the metadata at each step of the random walk. We require that the metadata values be encoded uniquely within each module, a requirement that, as we will later see, favors network partitions in which module labels align with metadata labels.

To model network dynamics, we consider a random surfer on the network. With probability

, the surfer behaves like a random walker, choosing to walk along an outgoing edge of its current node with probability proportional to the outgoing edge weights. With probability , the surfer teleports to an arbitrarily chosen node selected uniformly at random in the network. Considering this walk in the limit of an infinite number of steps, we arrive at a steady state distribution for every node in the network. For notational convenience, we normalize the outgoing edge weights of every node so that . We let denote a finite, discrete set of all metadata labels and assume that each node is tagged with exactly one .

The content map equation models entropy generated by the random surfer’s movements between modules and within modules identically to the traditional map equation. Between modules, we encode whenever the random surfer exits one module and enters another module. The chance at any given step that the surfer exits module is

where is the number of nodes in community . We denote the total chance at any given step that the random surfer exits a module as

By Shannon’s source coding theorem, the minimum entropy to encode the transitions between modules, the encoding we call the “inter-module codebook,” is

The random surfer’s movement within modules is another source of entropy in the map equation. Within each module, we encode the name of each node that the random surfer visits with steady-state frequency , and we use a special “exit” keyword occurring with frequency to encode when the random surfer exits the module. Together, these terms give the intra-module entropy for module weight

By Shannon’s source coding theorem, the minimum entropy to encode the transitions within a module, an encoding we call an “intra-module codebook,” is

The content map equation additionally models the entropy of the node metadata values at each step of the random surf. Within each module , we assign a codeword to each metadata value that occurs with frequency

and we let the total metadata weight of module be

By Shannon’s source coding theorem, the minimum entropy to encode the metadata values within module , in that module’s “metadata codebook,” is

By encoding the metadata values separately within each module, we reward partitions whose module labels align with the metadata values. Under this encoding method, if all nodes in a module have the same metadata value, the module name in the inter-module codebook alone thereby fully specifies the metadata value at each within-module movement step, and the metadata codebook contributes zero additional entropy for this module.

Summing the entropies of the inter-module codebook, the intra-module codebooks, and the metadata codebooks, weighted by their frequency of use, the corresponding content map equation for a given partition M becomes

(1)

where we introduce the parameter to control the relative weight of the metadata entropy. By increasing , we increasingly favor communities of nodes with shared metadata values. The special case is identical to the method proposed by Smith et alSmith et al. (2016). When each module contains only a single distinct metadata label, or when , the corresponding map equation reduces to the traditional map equation.

How can we interpret ? If we suppose that we send our message encoding the random surf over two different Shannon channels, one containing the information of the traditional map equation and the other containing the metadata information, then we can interpret as the relative cost to access the Shannon channel of metadata information. Alternatively, and equivalent mathematically, suppose that we send our message encoding the random surf over one channel. If we modify our encoding by sending a metadata codebook word at the frequency of each step of the random walk and sending intermodule codebook and intramodule codebook words at the frequency of each step of the random walk, then we can interpret as the relative frequency satisfying .

Iii Synthetic Graph Results

To analyze how varying impacts the content map equation’s ability to detect communities, we construct synthetic graphs according to a two-block planted-partition stochastic block model (SBM) with nodes evenly divided into communities, where an edge connecting two nodes in the same community exists with probability , and an edge connecting two nodes in different communities exists with probability . We additionally annotate each node with one of two discrete attribute labels, and the “alignment” parameter specifies the fraction of nodes for which the assignment using the attribute labels equates to the community assignment. Figure 1 shows results for different exploring the normalized mutual information (NMI) Strehl and Ghosh (2003); Vinh et al. (2010) between the planted partition and that identified by the content map equation, where each data point is the average of 100 trials with edge density . Throughout the paper, we use the scikit-learn implementation of normalized mutual information Pedregosa et al. (2011).

For the corresponding unannotated SBM, it has been shown in the limit as that the two-block planted-partition structure becomes undetectable for below the threshold given by . (For more detail, see Decelle et al. (2011); Nadakuditi and Newman (2012) and the discussion including non-sparse and also multilayer networks in Taylor et al. (2016).) Communities are only detectable when because otherwise the community structure itself is too weak relative to the background noise level in the generative model. For the parameters of our experiment, . Neither our content map equation implementation at

nor the official Infomap implementation achieves this theoretical detectability limit. This discrepancy may be in part due to finite-size corrections, as the theoretical limit is derived in terms of bulk eigenvalue distributions in the large-matrix limit. But it also likely depends at least partially on the specific algorithm employed, and we note in particular that the official Infomap implementation detects communities at smaller

than our implementation at in Fig. 1. Experimentally, we find that our content map equation implementation at begins to detect communities in the interval of .

Figure 1: Detectability experiments on a planted-partition stochastic block model with nodes evenly divided into two communities. The “alignment” is the fraction of nodes with attribute label equal to planted partition community, and measures the assortativity of the planted partition. We see that focusing on the metadata by increasing enables the algorithm to overcome the detectability limit when the metadata is well-aligned with the community structure, but increasing ceilings the algorithm’s performance when the metadata is unaligned with the community structure.

As expected for , for community structure below the detectability limit, i.e., when , we find that increasing increases NMI because, although the communities are undetectable from the network connectivity alone, the metadata provides additional information. Moreover, as the metadata becomes more aligned with the communities, it provides a greater increase to the algorithm’s ability to detect the correct communities. For example, as Fig. 1 illustrates, increasing the alignment from 0.9 to 1.0 increases the average NMI at for the highest values of from around to .

Our experiments show both that increasing can benefit NMI when the community structure has relatively low community assortativity, i.e., when is small, and that increasing can hurt NMI when the communities have relatively high assortativity, i.e., when is large. When is small, increasing allows the algorithm to detect the signal present in the metadata, which is greater than that present in the network structure. But when is large, increasing too much causes the algorithm to overfit the metadata and miss the communities present in the network structure. This effect can be seen in Fig. 1 at alignment , where at high values of achieve average NMI over compared to average NMI of for low values of , while at high values of have average NMI capped below while low values of achieve average NMI approaching the perfect score, .

We note the perhaps surprising result that increasing increases NMI at low values of even when the metadata is unaligned with the community structure, when alignment = . This behavior results because NMI can be greater than when one of the partitions is trivial. For example, consider nodes evenly divided into two communities compared to a second partition where each node is its own singleton community, yielding NMI . Taking the unaligned results as a baseline, we see in the figure that the increase in NMI at low values of becomes larger for higher values of alignment between the planted community structure and the metadata.

Iv Real-World Graph Results

iv.1 Lazega Lawyers Networks of Law Firm Relationships

The Lazega lawyers networks consist of 71 lawyers at a corporate lawfirm in the American Northeast Lazega (2001). Surveys were conducted to form the basis of three networks connecting the same actors: the coworking network, based on a survey question asking each lawyer with whom in the firm the lawyer has worked; the advice network, based on a survey question asking each lawyer to whom in the firm the lawyer has gone for professional advice; and the social network, based on a survey question asking each lawyer with whom in the firm the lawyer socializes outside of work. As node metadata, we additionally use information that each lawyer reported about the lawyer’s status (partner or associate), gender (man or woman), office (Boston, Hartford, or Providence), practice (litigation or corporate), and law school (Harvard / Yale, University of Connecticut, or other).

Figures 25 illustrate how increasing affects the returned network partitions. The figures show communities in the coworking network (after removing an isolated node) using only the metadata attribute school. Node shapes encode metadata values while node colors encode the algorithm’s partition. At , the algorithm optimizes for the traditional map equation, returning a partition based solely on network topology. With increasing , the algorithm returns modules more aligned with the metadata. For example, at , the algorithm returns the same partition as at . But at , the algorithm has identified a module of Harvard / Yale graduates, and at , the algorithm has identified a module where all but one are University of Connecticut graduates. Note, however, that even as the algorithm increasingly takes the metadata into consideration with increasing , the algorithm still respects the topology of the network because the random surfer, which follows network links with probability , proceeds independently of node metadata values.

Figure 2: The Lazega lawyers coworking network partitioned with attribute school at . Color encodes the partition while shape encodes the metadata.
Figure 3: The Lazega lawyers coworking network partitioned with attribute school at . Color encodes the partition while shape encodes the metadata.
Figure 4: The Lazega lawyers coworking network partitioned with attribute school at . Color encodes the partition while shape encodes the metadata.
Figure 5: The Lazega lawyers coworking network partitioned with attribute school at . Color encodes the partition while shape encodes the metadata.

Figure 6 gives another view of the role of and the metadata in the community detection process. Each point on the graph of Fig. 6 is an NMI calculation. “Gender”, “office,” “practice,” “school,” and “status” are the partitions of the network given by the respective metadata labels, and “c_gender,” “c_office,” “c_practice,” “c_school”, and “c_status” are partitions returned by the algorithm given the corresponding metadata type as input and the value of indicated by the x-axis. The lines of Fig. 6 show how the NMIs of pairs of these partitions change with . Pairs of partitions determined solely by metadata are constant with respect to because the metadata of each node is fixed. Pairs of community detection partitions considering different metadata begin with an NMI close to 1; when , the only difference in the returned partitions is due to stochasticity in optimization of the objective function. As increases, the algorithm returns partitions more aligned with the attribute under consideration, so the pairwise NMIs of these partitions decrease.

One can suppose that the optimal partition at is a point in the space of all possible partitions of the graph. In this interpretation, increasing for a given metadata type causes the optimal partition to shift in partition space toward partitions more aligned with the particular metadata type. As the optimal partitions for different metadata types undergo such shifts, they diverge in partition space, and as Fig. 6 illustrates, their pairwise NMI decreases.

Figure 6: Pairwise NMIs of Lazega lawyers coworking network partitions. For example, “gender” is the partition given by each node’s gender, and “c_gender” is the algorithm’s returned partition when clustering with the gender metadata for a given value of .

Figures 79 show the sum of the entropies of each codebook type, weighted by frequency of use but not relatively weighted by , for partitions of the Lazega lawyers networks at varying for the different metadata types. “Inter-module entropy” measures the first term of Equation 1, “intra-module entropy” measures the second term of Equation 1, and “metadata entropy” measures the third term of Equation 1, unweighted by . In all the plots, metadata entropy is at its maximum when , decreasing until the metadata entropy becomes for sufficiently large .

As Fig. 79 illustrate, once the relative weight of the metadata codebook is sufficiently large, an optimal partition’s metadata codebook will necessarily have zero entropy. Optimizing the content map equation in the limit as becomes a constrained optimization of the traditional map equation. A candidate optimal partition must have only one metadata attribute per module, and the optimal partition is the partition from this constrained region of partition space optimizing the traditional map equation.

The Lazega lawyers networks, which share the same set of attributed nodes but have different edge types, allow us empirically to study study how different edge formation processes influence metadata community detection. In Fig. 7 and Fig. 8, the school panel shows how different connectivity patterns among the same set of attributed nodes can lead to qualitatively different behavior with increasing . In the Fig. 7 coworker network, increasing results in a gradual transition toward partitions of zero metadata entropy. In the Fig. 8 advice network, on the other hand, there is a sharp transition point where the partitions switch from being optimal at to being optimal as .

Figure 7: Sums of various types of entropy when clustering the Lazega lawyers coworking network. The sums are weighted by frequency of codebook use but not weighted by .
Figure 8: Sums of various types of entropy when clustering the Lazega lawyers advice network. The sums are weighted by frequency of codebook use but not weighted by .
Figure 9: Sums of various types of entropy when clustering the Lazega lawyers friendship network. The sums are weighted by frequency of codebook use but not weighted by .

iv.2 Add Health Network of High School Friendship

The high school friendship network used here results from the US National Longitudinal Study of Adolescent Health and was provided by the Add Health project of the Carolina Population Center. Each of the 795 nodes of the graph is a student in an American middle school (7-8th grade, 12-14 years of age) and corresponding high school (9-12th grade, 14-18 years of age). Edges between nodes represent friendships determined by survey. As metadata for each node, we use the student survey data of grade (range 7-12), race (“white only”, “black only”, “any Hispanic”, “Asian only,” or “mixed / other”), school code (middle or high school), and sex (male or female).

The presence of various metadata types allows us to highlight a key feature of the algorithm, that it allows tuning to see how the network partitions under a particular metadata type of interest. In prior work on community detection with metadata, the method of Newman and Clauset Newman and Clauset (2016) was applied to the network three times, separately using grade, race, and gender, in each case partitioning the network into two communities. Using grade metadata, the algorithm splits the network into clear middle school and high school groups. Similarly, the algorithm divides the network into a predominantly white and a predominantly black group when it uses race metadata. However, when asked to use gender metadata, the algorithm of Newman and Clauset ignores the gender metadata because it does not have a strong enough correlation with the network structure. As Newman and Clauset note, for someone interested only in the metadata to the extent that it correlates with network structure, it is advantageous for the algorithm to disregard metadata that does not correlate.

But suppose that a priori a network analyst knows she cares about a particular metadata type. In our example, a social science researcher might be interested in how the high school friendship network organizes by gender, however strong or weak the gender partition might be. Using the algorithm of Newman and Clauset, there is no way for such a researcher to convey to the algorithm this preference for the gender metadata type. A key feature of our metadata map equation is the ability, using , to specify the relative weights of a given metadata type compared to the network topology in assigning communities.

Figure 10 and Fig. 11 demonstrate how our algorithm can specify a relative weighting for various metadata types. When , all of the partitions follow only the network topology. In that case, our results, consistent with those of the algorithm of Newman and Clauset, show that the metadata attributes of grade, school code, and race have the highest mutual information with the topological partition, with respective NMI values of , , and , while the metadata attribute of sex has the least mutual information with the topological partition, with an NMI of no more than . When we increase to , we see using each of the metadata values (grade, race, school code, sex) that the algorithm finds partitions of the network that, compared to the community detection done with only the network topology at , has increased NMI with the metadata.

Importantly, the partitioning of the high school network with a relative metadata channel weight of does not simply ignore the network structure. Consistent with the results of Newman and Clauset, we see that grade is the metadata value for which we can achieve the highest NMI between the algorithm’s partition and the node metadata, with an NMI of , and we find that our algorithm’s partition using sex has the least correspondence with the node metadata, an NMI of .

Figure 10: Pairwise NMIs of High School Social Network partitions at . For example, “grade” is the partition given by each node’s grade, and “c_grade” is the algorithm’s returned partition when clustering with the grade metadadta.
Figure 11: Pairwise NMIs of High School Social Network partitions at . For example, “grade” is the partition given by each node’s grade, and “c_grade” is the algorithm’s returned partition when clustering with the grade metadadta.

As Fig. 12 illustrates, one can understand increasing as paying a topological entropy price (the sum of the inter-module codebook and intra-module codebook entropies, which is equal to the traditional map equation) for increased NMI with the metadata. The varying shapes of the curves in Fig. 12 show how the price of this tradeoff at a given value of depends on how node metadata values relate to the network structure. For example, consider the curves corresponding to the school code and sex metadata. For school code metadata, the optimal partition at is already relatively close to meeting the constraint required as that each module have just one metadata attribute. One cannot trade topological entropy for much increase in NMI with the metadata because increasing does not much change the returned partition. For sex metadata, however, the optimal partition at is relatively far from having just one metadata attribute per module. By increasing , one can pay topological entropy for increased NMI with the metadata as the returned partition shifts toward obeying the constraint imposed as .

Figure 12: Topological entropy and NMI tradeoff when clustering with metadata in the high school social network. Topological entropy, equal to the traditional map equation, is the sum of the inter-module codebook and intra-module codebook entropies. The NMI is between the node metadata and the algorithm’s returned partition.

V Conclusion

We introduced a tuning parameter to the content map equation that explicitly specifies the importance of metadata relative to edge connectivity in assigning communities. We demonstrated on synthetic graphs how focusing on the metadata can overcome the detectability limit when the metadata is well-aligned with the topological community structure and also how focusing on the metadata can put a ceiling on the performance when the metadata is misaligned with the topological community structure. On real-world graphs, we demonstrated how a practitioner might tune the content map equation to “zoom in” and “zoom out” on communities with varying levels of metadata focus.

Our method probes the relationship between community structure and metadata. While we gave the algorithm only one type of metadata attribute at a time, future work might simultaneously incorporate metadata attributes of differing type and relative weighting. It might also be interesting to study how various metadata types relate to various network processes as, for example, different metadata types might relate to the spread of different kinds of information.

Acknowledgements

We are grateful to Peter Diao and William Weir for helpful discussions. Research reported in this publication was supported by the James S. McDonnell Foundation 21st Century Science Initiative - Complex Systems Scholar Award grant #220020315. The content is solely the responsibility of the authors and does not represent the official views of the sponsor.

This research uses data from Add Health, a program project directed by Kathleen Mullan Harris and designed by J. Richard Udry, Peter S. Bearman, and Kathleen Mullan Harris at the University of North Carolina at Chapel Hill, and funded by grant P01-HD31921 from the Eunice Kennedy Shriver National Institute of Child Health and Human Development, with cooperative funding from 23 other federal agencies and foundations. Special acknowledgment is due Ronald R. Rindfuss and Barbara Entwisle for assistance in the original design. Information on how to obtain the Add Health data files is available on the Add Health website (http://www.cpc.unc.edu/addhealth). No direct support was received from grant P01-HD31921 for this analysis.

References