Hypergraph Motifs: Concepts, Algorithms, and Discoveries

03/04/2020 ∙ by Geon Lee, et al. ∙ KAIST 수리과학과 0

Hypergraphs naturally represent group interactions, which are omnipresent in many domains: collaborations of researchers, co-purchases of items, joint interactions of proteins, to name a few. In this work, we propose tools for answering the following questions in a systematic manner: (Q1) what are structural design principles of real-world hypergraphs? (Q2) how can we compare local structures of hypergraphs of different sizes? (Q3) how can we identify domains which hypergraphs are from? We first define hypergraph motifs (h-motifs), which describe the connectivity patterns of three connected hyperedges. Then, we define the significance of each h-motif in a hypergraph as its occurrences relative to those in properly randomized hypergraphs. Lastly, we define the characteristic profile (CP) as the vector of the normalized significance of every h-motif. Regarding Q1, we find that h-motifs' occurrences in 11 real-world hypergraphs from 5 domains are clearly distinguished from those of randomized hypergraphs. In addition, we demonstrate that CPs capture local structural patterns unique in each domain, and thus comparing CPs of hypergraphs addresses Q2 and Q3. Our algorithmic contribution is to propose MoCHy, a family of parallel algorithms for counting h-motifs' occurrences in a hypergraph. We theoretically analyze their speed and accuracy, and we show empirically that the advanced approximate version MoCHy-A+ is up to 25X more accurate and 32X faster than the basic approximate and exact versions, respectively.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Complex systems consisting of pairwise interactions between individuals or objects are naturally expressed in the form of graphs. Nodes and edges, which compose a graph, represent individuals (or objects) and their pairwise interactions, respectively. Thanks to their powerful expressiveness, graphs have been used in a wide variety of fields, including social network analysis, web, bioinformatics, and epidemiology. Global structural patterns of real-world graphs, such as power-law degree distribution and [8, 18], and six degrees of separation [28, 60], have been extensively investigated.


Figure 1: Example hyperedge motifs (or h-motifs in short) (below) and their instances (above).
Figure 2: Distributions of h-motifs’ instances precisely characterize local structural patterns of real-world hypergraphs. Note that the hypergraphs from the same domains have similar distributions, while the hypergraphs from different domains do not. See Section 5.3 for details.

In addition to global patterns, real-world graphs exhibit patterns in their local structures, which differentiate graphs in the same domain from random graphs or those in other domains. Local structures are revealed by counting the occurrences of different network motifs [41, 42], which describe the connectivity pattern of pairwise interactions between a fixed number of connected nodes (typically , , or nodes). As a fundamental building block, network motifs have played a key role in many analytical and predictive tasks, including community detection [11, 39, 57, 62], classification [16, 34, 41]

, and anomaly detection

[9, 52].

Despite the prevalence of graphs, interactions in many complex systems are groupwise rather than pairwise: collaborations of researchers, co-purchases of items, joint interactions of proteins, tags attached to the same web post, to name a few. These group interactions cannot be represented by edges in a graph. Suppose three or more researchers coauthor a publication. This co-authorship cannot be represented as a single edge, and creating edges between all pairs of the researchers cannot be distinguished from multiple papers coauthored by subsets of the researchers.

This inherent limitation of graphs is addressed by hypergraphs, which consist of nodes and hyperedges. Each hyperedge is a subset of any number of nodes, and it represents a group interaction among the nodes. For example, in a hypergraph, a paper coauthored by three researchers , , and is expressed as a hyperedge , and it is distinguished from three papers coauthored by each pair, which are represented as three hyperedges , , and .

(a) Input Hypergraph
(b) Projected Graph
Figure 3: A hypergraph and its projected graph. Hyperedges in the hypergraph act as nodes in the projected graph.

The successful investigation and discovery of local structural patterns in real-world graphs motivates us to explore local structural patterns in real-world hypergraphs. However, network motifs, which proved to be useful for graphs, are not trivially extended to hypergraphs. Specifically, due to the flexibility in the size of hyperedges, there can be infinitely many connectivity patterns of interactions among a fixed number of nodes, and other nodes can also be associated with these interactions.

In this work, taking these challenges into consideration, we define hypergraph motifs (h-motifs) so that they describe connectivity patterns of three connected hyperedges (rather than nodes). As seen in Figure 1, h-motifs describe the connectivity pattern of hyperedges , , and by the emptiness of the subsets: , , , , , , and . As a result, every connectivity pattern is described by a unique h-motif, independently of the sizes of hyperedges. While this work focuses on connectivity patterns of three hyperedges, h-motifs are easily extended to four or more hyperedges.

We count the number of each h-motif’s instances in real-world hypergraphs from different domains. Then, we measure the significance of each h-motif in each hypergraph by comparing the count of its instances in the hypergraph against the counts in properly randomized hypergraphs. Lastly, we compute the characteristic profile (CP) of each hypergraph, defined as the vector of the normalized significance of every h-motif. Comparing the counts and CPs of different hypergraphs leads to the following observations:

  • [leftmargin=*]

  • Structural design principles of real-world hypergraphs that are captured by frequencies of different h-motifs are clearly distinguished from those of randomized hypergraphs.

  • Hypergraphs from the same domains have similar CPs, while hypergraphs from different domains have distinct CPs (see Figure 2). In other words, CPs successfully captures local structure patterns unique in each domain.

Our algorithmic contribution is to design MoCHy (Motif Counting in Hypergraphs), a family of parallel algorithms for counting h-motifs’ instances, which is the computational bottleneck of the above process. Note that since non-pairwise interactions (e.g., intersection of three hyperedges) are taken into consideration, counting the instances of h-motifs is more challenging than counting the instances of network motifs, which are defined solely based on pairwise interactions. We provide one exact version, named MoCHy-E, and two approximate versions, named MoCHy-A and MoCHy-A+. Empirically, MoCHy-A+ is up to more accurate than MoCHy-A, and it is up to faster than MoCHy-E

, with little sacrifice of accuracy. These empirical results are consistent with our theoretical analysis of their speed, bias, and variance.

In summary, our contributions are summarized as follow:

  • [leftmargin=*]

  • Novel Concepts: We propose h-motifs, the counts of whose instances capture local structures of hypergraphs, independently of the sizes of hyperedges or hypergraphs.

  • Fast and Provable Algorithms: We develop MoCHy, a family of parallel algorithms for counting h-motifs’ instances. We show theoretically and empirically that the advanced version significantly outperforms the basic ones, providing a better trade-off between speed and accuracy.

  • Discoveries in Real-world Hypergraphs: We show that h-motifs and CPs reveal local structural patterns that are shared by hypergraphs from the same domains but distinguished from those of random hypergraphs and hypergraphs from other domains (see Figure 2).

In Section 2, we discuss related works. In Section 3, we introduce h-motifs and characteristic profiles. In Section 4, we present exact and approximate algorithms for counting instances of h-motifs, and we analyze their theoretical properties. After providing experimental results in Section 5, we offer conclusions in Section 6.

2 Related Work

In this section, we review previous work on network motifs, algorithms for network motif counting, and hypergraphs. While the definition of a network motif varies among studies, here we define it as a connected graph composed by a predefined number of nodes.

Network Motifs. Network motifs were proposed as a tool for understanding the underlying design principles and capturing the local structural patterns of graphs [19, 50, 42]. The occurrences of motifs in real-world graphs are significantly different from those in random graphs [42] and also vary depending on the domains of graphs [41]. The concept of network motifs has been extended to various types of graphs, including dynamic [45], bipartite [13], and heterogeneous [47] graphs. The occurrences of network motifs have been used in a wide range of graph applications: community detection [11, 62, 39, 57], ranking [67], graph embedding [48, 65]

, and graph neural networks

[34], to name a few.

Algorithms for Network Motif Counting. Due to these wide applications, numerous algorithms have been proposed for rapid and accurate counting of the occurrences of motifs in large graphs. Some of them focus on counting the occurrences of a particular motif, such as the triangle (i.e., clique of three nodes) [2, 17, 20, 21, 26, 31, 33, 44, 51, 52, 56, 58, 59], the butterfly (i.e., biclique) [49], and the clique of nodes [25]. Others are for counting the occurrences of every motif of a fixed size [3, 4, 7, 14, 46]

. Many of these algorithms employ sampling to estimate the counts

[2, 7, 14, 17, 26, 44, 49, 51, 52, 56]. Note that these previous approaches for counting the occurrences of network motifs are not directly applicable to the problem of counting the occurrences of h-motifs. This is because different form network motifs, which are defined solely based on pairwise interactions, h-motifs are defined based on non-pairwise interactions (see Section 3.2).


Figure 4: The 26 h-motifs studied in this work. Note that h-motifs 17 - 22 are open, while the others are closed.
Notation Definition
hypergraph with nodes and hyperedges
set of hyperedges
set of hyperedges that contains a node
set of hyperwedges in
hyperwedge consisting of and
projected graph of
the number of nodes shared between and
set of neighbors of in
h-motif corresponding to an instance
count of h-motif ’s instances
Table 1: Frequently-used symbols.

Hypergraph

. Hypergraphs naturally represent group interactions, and they have been identified as a useful tool in a wide range of fields, including computer vision

[23, 22, 64], bioinformatics [24], circuit design [29, 43], social network analysis [61, 36], and recommender systems [15, 37]

. There also has been considerable attention on machine learning on hypergraphs, including clustering

[1, 6, 30, 38, 68], classification [27, 55, 64] and hyperedge prediction [10, 63, 66]. Recently, empirical studies on real-world hypergraphs have revealed several structural and temporal patterns [10, 12]. They focus on simplicial closure (i.e., the emergence of the first hyperedge that includes a set of nodes each of whose pairs co-appear in previous hyperedges) [10] and repetition of the same hyperedges and their subsets [12].

3 Proposed Concepts

In this section, we introduce the proposed concepts: hypergraph motifs and characteristic profiles. Refer Table 1 for the notations frequently used throughout the paper.

3.1 Preliminaries and Notations

We define some preliminary concepts and their notations.

Hypergraph Consider a hypergraph , where and are sets of nodes and hyperedges, respectively. Each hyperedge is a non-empty subset of , and we use to denotes the number of nodes in it. For each node , we use to denote the set of hyperedges that include . We say two hyperedges and are adjacent if they share any member, i.e., if . Then, for each hyperedge , we denote the set of hyperedges adjacent to as and the number of such hyperedges as . Similarly, we say three hyperedges , , and are connected if one of them is adjacent to two the others.

Hyperwedges: We define a hyperwedge as an unordered pair of adjacent hyperedges. We denote the set of hyperwedges in by . We use to denote the hyperwedge consisting of and . In the example hypergraph in Figure 3(a), there are six hyperwedges: , , , , , and .

Figure 5: The h-motifs (below) whose instances (above) contain duplicated hyperedges.

Projected Graph: We define the projected graph of by , where is the set of hyperwedges and . That is, in the projected graph , hyperedges in act as nodes, and two of them are adjacent if and only if they share any member. Note that for each hyperedge , is the set of neighbors of in , and is its degree in . Figure 3(b) shows the projected graph of the example hypergraph in Figure 3(a).

3.2 Hypergraph Motifs

We introduce hypergraph motifs, which are basic building blocks of hypergraphs, with related concepts. Then, we discuss their properties and generalization.

Definition and Representation: Hypergraph motifs (or h-motifs in short) are for describing the connectivity patterns of three connected hyperedges. Specifically, given a set of three connected hyperedges, h-motifs describe its connectivity pattern by the emptiness of the following seven sets: (1) , (2) , (3) , (4) , (5) , (6) , and (7) . Formally, a h-motif is defined as a binary vector of size whose elements represent the emptiness of the above sets, resp., and as seen in Figure 1, h-motifs are naturally represented in the Venn diagram. While there can be h-motifs, h-motifs remain once we exclude symmetric ones, those with duplicated hyperedges (see Figure 5), and those cannot be obtained from connected hyperedges. The 26 cases, which we call h-motif 1 through h-motif 26, are visualized in the Venn diagram in Figure 4.

Instances, Open h-motifs, and Closed h-motifs: Consider a hypergraph . A set of three connected hyperedges is an instance of h-motif if their connectivity pattern corresponds to h-motif . The count of each h-motif’s instances is used to characterize the local structure of , as discussed in the following sections. A h-motif is closed if all three hyperedges in its instances are adjacent to (i.e., overlapped with) each other. If its instances contain two non-adjacent (i.e., disjoint) hyperedges, a h-motif is open. In Figure 4, h-motifs - are open; the others are closed.

Properties of h-motifs: From the definition of h-motifs, the following desirable properties, which are discussed in Section 1, are immediate:

  • [leftmargin=*]

  • Exhaustive: h-motifs capture connectivity patterns of all possible three connected hyperedges.

  • Unique: connectivity pattern of any three connected hyperedges is captured by exactly one h-motif.

  • Size Independent: h-motifs capture connectivity patterns independently of the sizes of hyperedges. Note that there can be infinitely many combinations of sizes of three connected hyperedges.

Generalization to Four or More Hyperedges: The concept of h-motifs is easily generalized to four or more hyperedges. For example, a h-motif for four hyperedges can be defined as a binary vector of size indicating the emptiness of each region in the Venn diagram for four sets. We leave this generalization as future work and focus on the h-motifs for three hyperedges since they are already capable of characterizing local structures of real-world hypergraphs, as shown empirically in Section 5.

3.3 Characteristic Profile (CP)

What are the structural design principles of real-world hypergraphs distinguished from those of random hypergraphs? Below, we introduce the characteristic profile (CP), which is a tool for answering the above question using h-motifs.

Randomized Hypergraphs: While one might try to characterize the local structure of a hypergraph by absolute counts of each h-motifs’ instances in it, some h-motifs may naturally have many instances. Thus, for more accurate characterization, we need random hypergraphs to be compared against real-world hypergraphs. We obtain such random hypergraphs by randomizing a compared real-world hypergraph. To this end, we represent the hypergraph as a bipartite graph where and are the two partitions of nodes, and there exists an edge between if and only if . Then, we use the Chung-Lu bipartite graph generative model, which successfully preserves the degree distribution [5]. As a result, we obtain randomized hypergraphs in which the degree (i.e., the number of hyperedges that each node belongs to) distribution of nodes and the size distribution of hyperedges in are maintained.

Significance of H-motifs: We measure the significance of each h-motif in a hypergraph by comparing the count of its instances against the count of them in random hypergraphs. Specifically, the significance of a h-motif in a hypergraph is defined as

(1)

where is the number of instances of h-motif in , and is the average number of instances of h-motif randomized hypergraphs obtained as described above. We fixed to throughout this paper. This way of measuring significance was proposed in [42]

for network motifs as an alternative of normalized Z scores, which heavily depend on the graph size.

Characteristic Profile (CP): By normalizing and concatenating the significances of all h-motifs in a hypergraph, we obtain the characteristic profile (CP), which summarizes the local structural pattern in the hypergraph. Specifically, the characteristic profile of a hypergraph in is a vector of size , where each -th element is

(2)

Note that, for each , is between and . The CP is used in Section 5.3 to compare the local structural patterns of real-world hypergraphs from diverse domains.

4 Proposed Algorithms

Given a hypergraph, how can we count the instances of each h-motif? Once we count them in the original and randomized hypergraphs, the significance of each motif and the CP are obtained immediately by Eq. (1) and Eq. (2).

The problem of counting of h-motifs’ instances bears some similarity to the classic problem of counting network motifs’ instances. However, different from network motifs, which are defined solely based on pairwise interactions, h-motifs are defined based on non-pairwise interactions (e.g., ). Due to this difference, new approaches are required.

In this section, we present MoCHy (Motif Counting in Hypergraphs), which is a family of parallel algorithms for counting the instances of each h-motif in the input hypergraph. We first describe hypergraph projection, which is a preprocessing step of every version of MoCHy. Then, we present MoCHy-E, which is for exact counting. After that, we present two different versions of MoCHy-A, which are sampling-based algorithms for approximate counting. Lastly, we discuss parallel and on-the-fly implementations.

Throughout this section, we use to denote the h-motif that describes the connectivity pattern of an h-motif instance . We also use to denote the count of instances of h-motif .

1.35 Input : input hypergraph:
Output : projected graph:
1 map whose default value is
2 for each hyperedge  do
3        for each node  do
4               for each hyperedge where  do
5                     
6                     
7               end for
8              
9        end for
10       
11 end for
return
Algorithm 1 Hypergraph Projection (Preprocess)

4.1 Hypergraph Projection (Algorithm 1)

As a preprocessing step, every version of MoCHy builds the projected graph (see Section 3.1) of the input hypergraph , as described in Algorithm 1. To find the neighbors of each hyperedge (line 1), the algorithm visits each hyperedge that contains and satisfies (line 1) for each node (line 1). Then for each such , it adds to and increments (lines 1 and 1). The time complexity of this preprocessing step is given in Lemma 1.

Lemma 1 (Complexity of Hypergraph Projection).

The time complexity of Algorithm 1 is .

Proof.

If all sets and maps are implemented using hash tables, lines 1 and 1 take time, and they are executed times for each . ∎

Since and , Eq. (3) holds.

(3)

4.2 Exact H-motif Counting (Algorithm 2)

We present MoCHy-E (MoCHy Exact), which exactly count the instances of each h-motif. The procedures of MoCHy-E are described in Algorithm 2. For each hyperedge (line 2), each unordered pair of its neighbors, which forms an h-motif instance , is considered (line 2). If (i.e., if the corresponding h-motif is open), is considered only once. However, if (i.e., if the corresponding h-motif is closed), is considered two more times (i.e., when is chosen in line 2 and when is chosen in line 2). Based on these observations, given an h-motif instance , the corresponding count is incremented (line 2) only if or (line 2). This guarantees that each is counted exactly once. The time complexity of MoCHy-E is given in Theorem 1, which is based on Lemma 2.

Lemma 2 (Time Complexity of Computing ).

Given the input hypergraph and its projected graph , for each h-motif instance , computing takes time.

Proof.

Assume , without loss of generality, and all sets and maps are implemented using hash tables. As defined in Section 3.2, is computed in time from the emptiness of the following sets: (1) , (2) , (3) , (4) , (5) , (6) , and (7) . We check their emptiness from their cardinalities. We obtain , , and , which are stored in , and their cardinalities in time. Similarly, we obtain , , and , which are stored in , in time. Then, we compute in time by checking for each node in whether it is also in both and . From these cardinalities, we obtain the cardinalities of the six other sets in time as follows:

Hence, the time complexity of computing is . ∎

1.35 Input :    (1) input hypergraph:
(2) projected graph:
Output : exact count of each h-motif ’s instances:
map whose default value is
1 for each hyperedge  do
2        for each unordered hyperedge pair  do
3               if  or  then
4                     
5               end if
6              
7        end for
8       
9 end for
return
Algorithm 2 MoCHy-E: Exact H-motif Counting
Theorem 1 (Complexity of MoCHy-E).

The time complexity of Algorithm 2 is .

Proof.

Assume all sets and maps are implemented using hash tables. The total number of triples considered in line 2 is . By Lemma 2, for such a triple , computing takes time. Thus, the total time complexity of Algorithm 2 is , which upper bounds that of the preprocessing step (see Lemma 1 and Eq. (3)). ∎

4.3 Approximate H-motif Counting

We present two different versions of MoCHy-A (MoCHy A

pproximate), which approximately count the instances of each h-motif. Both versions estimate the counts by exploring the input hypergraph partially through hyperedge and hyperwedge sampling, resp., and thus they are particularly useful for large-scale hypergraphs. In addition, both versions yield unbiased estimates.

1.35 Input :    (1) input hypergraph:
(2) projected graph:
(3) number of samples:
Output : estimated count of each h-motif ’s instances:
map whose default value is
1 for  do
2        sample a uniformly random hyperedge
3        for each hyperedge  do
4               for each hyperedge  do
5                     
6               end for
7              
8        end for
9       
10 end for
11 for each h-motif  do
12       
13 end for
return
Algorithm 3 MoCHy-A: Approximate H-motif Counting Based on Hyperedge Sampling

MoCHy-A: Hyperedge Sampling (Algorithm 3):

MoCHy-A, which is based on hyperedge sampling, is described in Algorithm 3. It repeatedly samples hyperedges from the hyperedge set uniformly at random with replacement (line 3). For each sampled hyperedge , the algorithm searches for all h-motif instances that contain (lines 3-3), and to this end, the -hop and -hop neighbors of in the projected graph are explored. After that, for each such instance of h-motif , the corresponding count is incremented (line 3). Lastly, each estimate is rescaled by multiplying it with (lines 3-3), which is the reciprocal of the expected number of times that each of the h-motif ’s instances is counted. Note that each hyperedge is expected to be sampled times, and each h-motif instance is counted whenever any of its three hyperedges is sampled. This rescaling makes each estimate unbiased, as formalized in Theorem 2.

Theorem 2 (Bias and Variance of MoCHy-A).

For every h-motif t, Algorithm 3 provides an unbiased estimate of the count of its instances, i.e.,

(4)

The variance of the estimate is

(5)

where is the number of pairs of h-motif ’s instances that share hyperedges.

Proof.

See Appendix A. ∎

The time complexity of MoCHy-A is given in Theorem 3.

Theorem 3 (Complexity of MoCHy-A).

The average time complexity of Algorithm 3 is .

Proof.

Assume all sets and maps are implemented using hash tables. For a sample hyperedge , computing for every takes time, and by Lemma 2, computing for all considered h-motif instances takes time. Thus, from , the time complexity for processing a sample is

which can be written as

From this, linearity of expectation, is sampled, and is adjacent to the sample, the average time complexity per sample hyperedge becomes . Hence, the total time complexity for processing samples is .∎

1.35 Input :    (1) input hypergraph:
(2) projected graph:
(3) number of samples:
Output : estimated count of each h-motif ’s instances:
map whose default value is
1 for  do
2        a uniformly random hyperwedge
3        for each hyperedge  do
4              
5        end for
6       
7 end for
8for each h-motif  do
9        if 17 t 22 then open h-motifs
10              
11       else closed h-motifs
12              
13        end if
14       
15 end for
return
Algorithm 4 MoCHy-A+: Approximate H-motif Counting Based on Hyperwedge Sampling

MoCHy-A+: Hyperwedge Sampling (Algorithm 4):

MoCHy-A+, which provides a better trade-off between speed and accuracy than MoCHy-A, is described in Algorithm 4. Different from MoCHy-A, which samples hyperedges, MoCHy-A+ is based on hyperwedge sampling. It selects hyperwedges uniformly at random with replacement (line 4), and for each sampled hyperwedge , it searches for all h-motif instances that contain (lines 4-4). To this end, the hyperedges that are adjacent to or in the projected graph are considered (line 4). For each such instance of h-motif , the corresponding estimate is incremented (line 4). Lastly, each estimate is rescaled so that it unbiasedly estimates , as formalized in Theorem 4. To this end, each estimate is multiplied by the reciprocal of the expected number of times that each instance of h-motif is counted. Note that each instance of open and closed h-motifs contains and hyperwedges, respectively. Each instance of closed h-motifs is counted if one of the hyperwedges in it is sampled, while that of open h-motifs is counted if one of the hyperwedges in it is sampled. Thus, on average, each instance of open and closed h-motifs is counted and times, respectively.

Theorem 4 (Bias and Variance of MoCHy-A+).

For every h-motif t, Algorithm 4 provides an unbiased estimate of the count of its instances, i.e.,

(6)

For every closed motif , the variance of the estimate is

(7)

where is the number of pairs of h-motif ’s instances that share hyperwedges. For every open motif , the variance is

(8)
Proof.

See Appendix B. ∎

The time complexity of MoCHy-A+ is given in Theorem 5.

Theorem 5 (Complexity of MoCHy-A+).

The average time complexity of Algorithm 4 is .

Proof.

Assume all sets and maps are implemented using hash tables. For a sample hyperwedge , computing takes time, and by Lemma 2, computing for all considered h-motif instances takes time. Thus, from , the time complexity for processing a sample is which can be written as

From this, linearity of expectation, is included in the sample, and is included in the sample, the average time complexity per sample hyperwedge is . Hence, the total time complexity for processing samples is .∎

Comparison of MoCHy-A and MoCHy-A+: Empirically, MoCHy-A+ provides a better trade-off between speed and accuracy than MoCHy-A, as presented in Section 5.4. Below, we provide an analysis that supports this observation.

Assume that the numbers of samples in both algorithms are set so that . For each h-motif , since both estimates of MoCHy-A and of MoCHy-A+ are unbiased (see Eq. (4) and (6)), we only need to compare their variances. By Eq. (5), , and by Eq. (7) and Eq. (8), . By definition, , and thus . Moreover, in real-world hypergraphs, tends to be several orders of magnitude larger than the other terms (i.e., , , and ), and thus of MoCHy-A tends to have much larger variance (and thus much larger estimation error) than of MoCHy-A+. Despite this fact, MoCHy-A and MoCHy-A+ have the same time complexity, which is (see Theorems 3 and 5). Hence, MoCHy-A+ is expected to provide a better trade-off between speed and accuracy than MoCHy-A, as confirmed empirically in Section 5.4.

4.4 Parallel and On-the-fly Implementations

We discuss parallelization of MoCHy and then on-the-fly computation of projected graphs.

Parallelization: All versions of MoCHy and hypergraph projection are easily parallelized. Specifically, we can parallelize hypergraph projection and MoCHy-E by letting multiple threads process different hyperedges (in line 1 of Algorithm 1 and line 2 Algorithm 2, respectively) independently in parallel. Similarly, we can parallelize MoCHy-A and MoCHy-A+ by letting multiple threads sample and process different hyperedges (in line 3 of Algorithm 3) and hyperwedges (in line 4 of Algorithm 4), respectively, independently in parallel. The estimated counts of the same h-motif obtained by different threads are summed up only once before they are returned as outputs. We present some empirical results in Section 5.4.

H-motif Counting without Projected Graphs: If the input hypergraph is large, computing its projected graph (Algorithm 3) is time and space consuming. Specifically, building takes time (see Lemma 1) and requires space, which often exceeds space required for storing . Thus, instead of precomputing entirely, we can build it incrementally while memoizing partial results within a given memory budget. For example, in MoCHy-A+ (Algorithm 4), we compute the neighborhood of a hyperedge in (i.e., ) only if (1) a hyperwedge with (e.g., ) is sampled (in line 4) and (2) its neighborhood is not memoized.

This incremental computation of can be beneficial in terms of speed since it skips projecting the neighborhood of a hyperedge if no hyperwedges containing it is sampled. However, it can also be harmful if memoized results exceed the memory budget and some parts of need to be rebuilt multiple times. Then, given a memory budget in bits, how should we prioritize hyperedges if all their neighborhoods cannot be memoized? According to our experiments, despite their large size, memoizing the neighborhoods of hyperedges with high degree in makes MoCHy-A+ faster than memoizing the neighborhoods of randomly chosen hyperedges or least recently used (LRU) hyperedges. In Section 5.4, we experimentally examine the effects of the memory budget on the speed of MoCHy-A+.

5 Experiments

Table 2: Real-world and random hypergraphs have distinct distributions of h-motif instances. We report the absolute number of each h-motif’s instances in a real-world hypergraph from each domain and its corresponding random hypergraph. The counts are ranked, and the ranks in the real-world and corresponding random hypergraphs are compared.
Dataset # h-motifs
coauth-DBLP 1,924,991 2,466,792 125M 26.3B 18M
coauth-geology 1,256,385 1,203,895 37.6M 6B 4.8M
coauth-history 1,014,734 895,439 1.7M 83.2M
contact-primary 242 12,704 2.2M 617M
contact-high 327 7,818 593K 69.7M
email-Enron 143 1,512 87.8K 9.6M
email-EU 998 25,027 8.3M 7B
tags-ubuntu 3,029 147,222 564M 4.3T 1.5B
tags-math 1,629 170,476 913M 9.2T 3.2B
threads-ubuntu 125,602 166,999 21.6M 11.4B
threads-math 176,445 595,749 647M 2.2T 883M
Table 3: Statistics of 11 real hypergraphs from 5 domains.

In this section, we review our experiments that we design for answering the following questions:

  • [leftmargin=*]

  • Q1. Comparison with Random: Does counting instances of different h-motifs reveal structural design principles of real-world hypergraphs distinguished from those of random hypergraphs?

  • Q2. Comparison across Domains: Do characteristic profiles capture local structural patterns of hypergraphs unique to each domain?

  • Q3. Performance of Counting Algorithms: How fast and accurate are the different versions of MoCHy? Does the advanced version outperform the basic ones?

5.1 Experimental Settings

Machines: We conducted all the experiments on a machine with an AMD Ryzen 9 3900X CPU and 128GB RAM.

Implementations: We implemented every version of MoCHy using C++ and OpenMP.

Datasets: We used the following 11 real-world hypergraphs from five different domains:

  • [leftmargin=*]

  • co-authorship (coauth-DBLP, coauth-geology [53], and coauth-history [53]): A node represents an author. A hyperedge represents the authors of a publication.

  • contact (contact-primary [54] and contact-high [40]): A node represents a person. A hyperedge represents a group interaction among individuals.

  • email (email-Enron [32] and email-EU [35, 62]): A node represents an e-mail account. A hyperedge consists of the sender and all receivers of an email.

  • tags (tags-ubuntu and tags-math): A node represents a tag. A hyperedge represents the tags attached to a post.

  • threads (threads-ubuntu and threads-math): A node represents a user. A hyperedge groups all users participating in a thread.

These hypergraphs are made public by the authors of [10]111https://www.cs.cornell.edu/~arb/data/, and in Table 3 we provide some statistics of the hypergraphs after removing duplicated hyperedges. We used MoCHy-E for the coauth-history dataset, the threads-ubuntu dataset, and all datasets from the contact and email domains. For the other datasets, we used MoCHy-A+ with , unless otherwise stated. For the tags-ubuntu, tags-math, and threads-math datasets with trillions of h-motifs, we set the memory budget to of the edges in the projected graphs, as described in Section 4.4, while we precomputed the projected graphs for the other datasets. We used a single thread unless otherwise stated. We computed CPs based on five hypergraphs randomized as described in Section 3.2.

5.2 Q1. Comparison with Random

To answer Q1, we analyze the counts of different h-motifs’ instances in real and random hypergraphs. In Table 2, we report the (approximated) count of each h-motif ’s instances in each real hypergraph with the corresponding count averaged over five random hypergraphs that are obtained as described in Section 3.2. We rank h-motifs by the counts of their instances and examine the difference between the ranks in real and corresponding random hypergraphs. As seen in the table, the count distributions in real hypergraphs are clearly distinguished from those of random hypergraphs.

H-motifs in Random Hypergraphs: We first notice from Table 2 that instances of h-motifs and appear much more frequently in random hypergraphs than in real hypergraphs from all domains. For example, instances of h-motif appear only about thousand times in the tags-math dataset, while they appeared about million times (about more often) in the corresponding randomized hypergraphs. Additionally, in the threads-math dataset, instances of h-motif appear about thousand times, while they appear about billion times (about more often). Instances of h-motifs and consist of a hyperedge and its two disjoint subsets (see Figure 4).

H-motifs in Co-authorship Hypergraphs: We observe that instances of h-motifs , and appear more frequently in all three hypergraphs from the co-authorship domain than in the corresponding random hypergraphs. For example, while there is no single instance of h-motif in the corresponding random hypergraphs, there are about thousand such instances in the coauth-history dataset. While there are only about instances of h-motif in the corresponding random hypergraphs, there are about million such instances (about more instances) in the coauth-history dataset. As seen in Figure 4, in instances of h-motifs , , and , a hyperedge is overlapped with the two other overlapped hyperedges in three different ways.

H-motifs in Contact Hypergraphs: Instances of h-motifs , , and are noticeably more common in both contact datasets than in the corresponding random hypergraphs. As seen in Figure 4, in instances of h-motifs , and , hyperedges are tightly connected and nodes are mainly located in the intersections of all or some hyperedges.

H-motifs in Email Hypergraphs: Both email datasets contain particularly many instances of h-motifs and , compared to the corresponding random hypergraphs. As seen in Figure 4, instances of h-motifs and consist of three hyperedges where most nodes are contained in one hyperedge.

H-motifs in Tags Hypergraphs: In addition to instances of h-motif , which are common in most real hypergrahps, instances of h-motif , where all seven regions are not empty (see Figure 4), particularly frequent in both tags datasets than in corresponding random hypergraphs.

H-motifs in Threads Hypergraphs: Lastly, in both data sets from the threads domain, instances of h-motifs and are noticeably more frequent than expected from the corresponding random hypergrpahs.

(a) co-authorship
(b) tags
(c) contact
(d) email
(e) threads
Figure 6: Characteristic profiles (CPs) capture local structural patterns in real-world hypergraphs accurately. The CPs are similar within domains but different across domains. Note that the significance of h-motif 3 distinguishes the contact hypergraphs from the email hypergraphs.
(a) Similarity Matrix from Hypergraph Motifs (Proposed)
(b) Similarity Matrix from Network Motifs (Baseline)
Figure 7: Characteristic profiles (CPs) based on hypergraph motifs (h-motifs) capture local structural patterns more accurately than CPs based on network motifs. The CPs based on h-motifs clearly distinguishes the domains of the real-world hypergraphs (the average correlation coefficient is within domains and across domains), compared to the CPs based on network motifs (the average correlation coefficient is within domains and across domains).
(a) threads-ubuntu
(b) email-Eu
(c) contact-primary
(d) coauth-History
(e) contact-high
(f) email-Enron
Figure 8: MoCHy-A+ provides the best trade-off between speed and accuracy. Specifically, MoCHy-A+ produces up to more accurate estimation than MoCHy-A, and it is up to faster than MoCHy-E. The error bars indicate standard error over trials.
(a) email-EU
(b) contact-primary
(c) coauth-history
Figure 9: Using MoCHy-A+, characteristic profiles (CPs) can be estimated accurately from a small number of samples.
(a) Elapsed Time
(b) Speedup
Figure 10: Both MoCHy-E and MoCHy-A+ achieve significant speedups with multiple threads.

5.3 Q2. Comparison across Domains

To answer Q2, we compare the characteristic profiles (CPs) of the real-world hypergraphs. In Figure 6, we present the CPs (i.e., the significances of the h-motifs) of each hypergraph. As seen in the figure, hypergraphs from the same domains have similar CPs. Specifically, all three hypergraphs from the co-authorship domain share extremely similar CPs, even when the absolute counts of h-motifs in them are several orders of magnitude different. Similarly, the CPs of both hypergraphs form the tags domain are extremely similar. However, the CPs of the hypergraphs from the co-authorship domain are clearly distinguished by them of the hypergraphs from the tags domain. While the CPs of the hypergraphs from the contact domain and the CPs of those from the email domain are similar for the most part, they are distinguished by the significance of h-motif 3. These observations confirm that CPs accurately capture local structural patterns of real-world hypergraphs.

To further verify the effectiveness of CPs based on h-motifs, we compare them with CPs based on network motifs. Specifically, we represent each hypergraph as a bipartite graph where and . That is, the two types of nodes in the transformed bipartite graph represent the nodes and hyperedges, resp., in the original hypergraph , and each edge in indicates that the node belongs to the hyperedge in . Then, we compute the CPs based on the network motifs consisting of to nodes222We used Motivo [14], which is a state-of-the-art approximate algorithm for network motif counting. Lastly, based on both CPs, we compute the similarity matrices (specifically, correlation coefficient matrices) of the real-world hypergraphs. As seen in Figure 7, the domains of the real-world hypergraphs are distinguished more clearly by the CPs based on h-motifs than by the CPs based on network motifs. Numerically, when the CPs based on h-motifs are used, the average correlation coefficient is within domains and across domains, and thus the gap is . However, when the CPs based on network motifs are used, the average correlation coefficient is within domains and across domains, and thus the gap is just . These results support that h-motifs play a key role in capturing the local structural patterns of real-world hypergraphs.

5.4 Q3. Performance of Counting Algorithms

To answer Q3, we test the speed and accuracy of all versions of MoCHy under various settings. To this end, we measure elapsed time and relative error defined as

for MoCHy-A and MoCHy-A+, respectively.

Speed and Accuracy: In Figure 8, we report the elapsed time and relative error of all versions of MoCHy on the different datasets where MoCHy-E terminates within a reasonable time. The numbers of samples in MoCHy-A and MoCHy-A+ are set to percent of the counts of hyperedges and hyperwedges, respectively. As seen in the figure, MoCHy-A+ provides the best trade-off between speed and accuracy. For example, in the threads-ubuntu dataset, MoCHy-A+ provides lower relative error than MoCHy-A, consistently with our theoretical analysis (see the last paragraph of Section 4.3). Moreover, in the same dataset, MoCHy-A+ is faster than MoCHy-E with little sacrifice on accuracy.

Effects of the Sample Size on CPs: In Figure 9, we report the CPs obtained by MoCHy-A+ with different numbers of hyperwedge samples on datasets. Even with a smaller number of samples, the CPs are estimated near perfectly.

Parallelization: In Figure 10, we measure the running times of MoCHy-E and MoCHy-A+ with different numbers of threads on the threads-ubuntu dataset. As seen in the figure, both algorithms achieve significant speedups with multiple threads. Specifically, with threads, MoCHy-E and MoCHy-A+ () achieve speedups of and , respectively.

Effects of On-the-fly Computation on Speed: We analyze the effects of the on-the-fly computation of projected graphs (discussed in Section 4.4) on the speed of MoCHy-A+ under different memory budgets for memoization. To this end, we use the threads-ubuntu dataset, and we set the memory budgets so that up to of the edges in the projected graph can be memoized. The results are shown in Figure 11. As the memory budget increases, MoCHy-A+ becomes faster, avoiding repeated computation. Due to our careful prioritization scheme based on degree, by memoizing only of the edges, MoCHy-A+ achieves speedups of about .

6 Conclusions

In this work, we introduce hypergraph motifs (h-motifs), and using them, we investigate the local structures of real-world hypergraphs from different domains. We summarize our contributions as follows:

  • [leftmargin=*]

  • Novel Concepts: We define 26 h-motifs, which describe connectivity patterns of connected hyperedges in a unique and exhaustive way, independently of the sizes of hyperedges (Figure 4).

  • Fast and Provable Algorithms: We propose parallel algorithms for (approximately) counting every h-motif’s instances, and we theoretically and empirically analyze their speed and accuracy. Both approximate algorithms yield unbiased estimates (Theorems 2 and 4), and especially the advanced one is up to faster than the exact algorithm, with little sacrifice on accuracy (Figure 8).

  • Discoveries in Real-world Hypergraphs: We confirm the efficacy of h-motifs by showing that local structural patterns captured by them are similar within domains but different across domains (Figures 6 and 7).

(a) Elapsed Time
(b) Speedup
Figure 11: Memoizing a small fraction of projected graphs leads to significant speedups of MoCHy-A+.

Future research directions include (1) extending h-motifs to richer hypergraphs, such as heterogenous or temporal hypergraphs, and (2) incorporating h-motifs into various tasks, such as hypergraph embedding, ranking, and clustering.

Appendix A Proof of Theorem 2

We let

be a random variable indicating whether the

-th sampled hyperedge (in line 3 of Algorithm 3) is included in the -th instance of h-motif or not. That is, if the hyperedge is included in the instance, and otherwise. We let be the number of times that h-motif ’s instances are counted while processing sampled hyperedges. That is,

(9)

Then, by lines 3-3 of Algorithm 3,

(10)

Proof of the Bias of (Eq. (4)):

Proof.

Since each h-motif instance contains three hyperedges, the probability that each

-th sampled hyperedge is contained in each -th instance of h-motif is

(11)

From linearity of expectation,