1 Introduction
Complex systems consisting of pairwise interactions between individuals or objects are naturally expressed in the form of graphs. Nodes and edges, which compose a graph, represent individuals (or objects) and their pairwise interactions, respectively. Thanks to their powerful expressiveness, graphs have been used in a wide variety of fields, including social network analysis, web, bioinformatics, and epidemiology. Global structural patterns of realworld graphs, such as powerlaw degree distribution and [8, 18], and six degrees of separation [28, 60], have been extensively investigated.
In addition to global patterns, realworld graphs exhibit patterns in their local structures, which differentiate graphs in the same domain from random graphs or those in other domains. Local structures are revealed by counting the occurrences of different network motifs [41, 42], which describe the connectivity pattern of pairwise interactions between a fixed number of connected nodes (typically , , or nodes). As a fundamental building block, network motifs have played a key role in many analytical and predictive tasks, including community detection [11, 39, 57, 62], classification [16, 34, 41]
, and anomaly detection
[9, 52].Despite the prevalence of graphs, interactions in many complex systems are groupwise rather than pairwise: collaborations of researchers, copurchases of items, joint interactions of proteins, tags attached to the same web post, to name a few. These group interactions cannot be represented by edges in a graph. Suppose three or more researchers coauthor a publication. This coauthorship cannot be represented as a single edge, and creating edges between all pairs of the researchers cannot be distinguished from multiple papers coauthored by subsets of the researchers.
This inherent limitation of graphs is addressed by hypergraphs, which consist of nodes and hyperedges. Each hyperedge is a subset of any number of nodes, and it represents a group interaction among the nodes. For example, in a hypergraph, a paper coauthored by three researchers , , and is expressed as a hyperedge , and it is distinguished from three papers coauthored by each pair, which are represented as three hyperedges , , and .


The successful investigation and discovery of local structural patterns in realworld graphs motivates us to explore local structural patterns in realworld hypergraphs. However, network motifs, which proved to be useful for graphs, are not trivially extended to hypergraphs. Specifically, due to the flexibility in the size of hyperedges, there can be infinitely many connectivity patterns of interactions among a fixed number of nodes, and other nodes can also be associated with these interactions.
In this work, taking these challenges into consideration, we define hypergraph motifs (hmotifs) so that they describe connectivity patterns of three connected hyperedges (rather than nodes). As seen in Figure 1, hmotifs describe the connectivity pattern of hyperedges , , and by the emptiness of the subsets: , , , , , , and . As a result, every connectivity pattern is described by a unique hmotif, independently of the sizes of hyperedges. While this work focuses on connectivity patterns of three hyperedges, hmotifs are easily extended to four or more hyperedges.
We count the number of each hmotif’s instances in realworld hypergraphs from different domains. Then, we measure the significance of each hmotif in each hypergraph by comparing the count of its instances in the hypergraph against the counts in properly randomized hypergraphs. Lastly, we compute the characteristic profile (CP) of each hypergraph, defined as the vector of the normalized significance of every hmotif. Comparing the counts and CPs of different hypergraphs leads to the following observations:

[leftmargin=*]

Structural design principles of realworld hypergraphs that are captured by frequencies of different hmotifs are clearly distinguished from those of randomized hypergraphs.

Hypergraphs from the same domains have similar CPs, while hypergraphs from different domains have distinct CPs (see Figure 2). In other words, CPs successfully captures local structure patterns unique in each domain.
Our algorithmic contribution is to design MoCHy (Motif Counting in Hypergraphs), a family of parallel algorithms for counting hmotifs’ instances, which is the computational bottleneck of the above process. Note that since nonpairwise interactions (e.g., intersection of three hyperedges) are taken into consideration, counting the instances of hmotifs is more challenging than counting the instances of network motifs, which are defined solely based on pairwise interactions. We provide one exact version, named MoCHyE, and two approximate versions, named MoCHyA and MoCHyA^{+}. Empirically, MoCHyA^{+} is up to more accurate than MoCHyA, and it is up to faster than MoCHyE
, with little sacrifice of accuracy. These empirical results are consistent with our theoretical analysis of their speed, bias, and variance.
In summary, our contributions are summarized as follow:

[leftmargin=*]

Novel Concepts: We propose hmotifs, the counts of whose instances capture local structures of hypergraphs, independently of the sizes of hyperedges or hypergraphs.

Fast and Provable Algorithms: We develop MoCHy, a family of parallel algorithms for counting hmotifs’ instances. We show theoretically and empirically that the advanced version significantly outperforms the basic ones, providing a better tradeoff between speed and accuracy.

Discoveries in Realworld Hypergraphs: We show that hmotifs and CPs reveal local structural patterns that are shared by hypergraphs from the same domains but distinguished from those of random hypergraphs and hypergraphs from other domains (see Figure 2).
In Section 2, we discuss related works. In Section 3, we introduce hmotifs and characteristic profiles. In Section 4, we present exact and approximate algorithms for counting instances of hmotifs, and we analyze their theoretical properties. After providing experimental results in Section 5, we offer conclusions in Section 6.
2 Related Work
In this section, we review previous work on network motifs, algorithms for network motif counting, and hypergraphs. While the definition of a network motif varies among studies, here we define it as a connected graph composed by a predefined number of nodes.
Network Motifs. Network motifs were proposed as a tool for understanding the underlying design principles and capturing the local structural patterns of graphs [19, 50, 42]. The occurrences of motifs in realworld graphs are significantly different from those in random graphs [42] and also vary depending on the domains of graphs [41]. The concept of network motifs has been extended to various types of graphs, including dynamic [45], bipartite [13], and heterogeneous [47] graphs. The occurrences of network motifs have been used in a wide range of graph applications: community detection [11, 62, 39, 57], ranking [67], graph embedding [48, 65]
, and graph neural networks
[34], to name a few.Algorithms for Network Motif Counting. Due to these wide applications, numerous algorithms have been proposed for rapid and accurate counting of the occurrences of motifs in large graphs. Some of them focus on counting the occurrences of a particular motif, such as the triangle (i.e., clique of three nodes) [2, 17, 20, 21, 26, 31, 33, 44, 51, 52, 56, 58, 59], the butterfly (i.e., biclique) [49], and the clique of nodes [25]. Others are for counting the occurrences of every motif of a fixed size [3, 4, 7, 14, 46]
. Many of these algorithms employ sampling to estimate the counts
[2, 7, 14, 17, 26, 44, 49, 51, 52, 56]. Note that these previous approaches for counting the occurrences of network motifs are not directly applicable to the problem of counting the occurrences of hmotifs. This is because different form network motifs, which are defined solely based on pairwise interactions, hmotifs are defined based on nonpairwise interactions (see Section 3.2).Notation  Definition 

hypergraph with nodes and hyperedges  
set of hyperedges  
set of hyperedges that contains a node  
set of hyperwedges in  
hyperwedge consisting of and  
projected graph of  
the number of nodes shared between and  
set of neighbors of in  
hmotif corresponding to an instance  
count of hmotif ’s instances 
Hypergraph
. Hypergraphs naturally represent group interactions, and they have been identified as a useful tool in a wide range of fields, including computer vision
[23, 22, 64], bioinformatics [24], circuit design [29, 43], social network analysis [61, 36], and recommender systems [15, 37]. There also has been considerable attention on machine learning on hypergraphs, including clustering
[1, 6, 30, 38, 68], classification [27, 55, 64] and hyperedge prediction [10, 63, 66]. Recently, empirical studies on realworld hypergraphs have revealed several structural and temporal patterns [10, 12]. They focus on simplicial closure (i.e., the emergence of the first hyperedge that includes a set of nodes each of whose pairs coappear in previous hyperedges) [10] and repetition of the same hyperedges and their subsets [12].3 Proposed Concepts
In this section, we introduce the proposed concepts: hypergraph motifs and characteristic profiles. Refer Table 1 for the notations frequently used throughout the paper.
3.1 Preliminaries and Notations
We define some preliminary concepts and their notations.
Hypergraph Consider a hypergraph , where and are sets of nodes and hyperedges, respectively. Each hyperedge is a nonempty subset of , and we use to denotes the number of nodes in it. For each node , we use to denote the set of hyperedges that include . We say two hyperedges and are adjacent if they share any member, i.e., if . Then, for each hyperedge , we denote the set of hyperedges adjacent to as and the number of such hyperedges as . Similarly, we say three hyperedges , , and are connected if one of them is adjacent to two the others.
Hyperwedges: We define a hyperwedge as an unordered pair of adjacent hyperedges. We denote the set of hyperwedges in by . We use to denote the hyperwedge consisting of and . In the example hypergraph in Figure 3(a), there are six hyperwedges: , , , , , and .
Projected Graph: We define the projected graph of by , where is the set of hyperwedges and . That is, in the projected graph , hyperedges in act as nodes, and two of them are adjacent if and only if they share any member. Note that for each hyperedge , is the set of neighbors of in , and is its degree in . Figure 3(b) shows the projected graph of the example hypergraph in Figure 3(a).
3.2 Hypergraph Motifs
We introduce hypergraph motifs, which are basic building blocks of hypergraphs, with related concepts. Then, we discuss their properties and generalization.
Definition and Representation: Hypergraph motifs (or hmotifs in short) are for describing the connectivity patterns of three connected hyperedges. Specifically, given a set of three connected hyperedges, hmotifs describe its connectivity pattern by the emptiness of the following seven sets: (1) , (2) , (3) , (4) , (5) , (6) , and (7) . Formally, a hmotif is defined as a binary vector of size whose elements represent the emptiness of the above sets, resp., and as seen in Figure 1, hmotifs are naturally represented in the Venn diagram. While there can be hmotifs, hmotifs remain once we exclude symmetric ones, those with duplicated hyperedges (see Figure 5), and those cannot be obtained from connected hyperedges. The 26 cases, which we call hmotif 1 through hmotif 26, are visualized in the Venn diagram in Figure 4.
Instances, Open hmotifs, and Closed hmotifs: Consider a hypergraph . A set of three connected hyperedges is an instance of hmotif if their connectivity pattern corresponds to hmotif . The count of each hmotif’s instances is used to characterize the local structure of , as discussed in the following sections. A hmotif is closed if all three hyperedges in its instances are adjacent to (i.e., overlapped with) each other. If its instances contain two nonadjacent (i.e., disjoint) hyperedges, a hmotif is open. In Figure 4, hmotifs  are open; the others are closed.
Properties of hmotifs: From the definition of hmotifs, the following desirable properties, which are discussed in Section 1, are immediate:

[leftmargin=*]

Exhaustive: hmotifs capture connectivity patterns of all possible three connected hyperedges.

Unique: connectivity pattern of any three connected hyperedges is captured by exactly one hmotif.

Size Independent: hmotifs capture connectivity patterns independently of the sizes of hyperedges. Note that there can be infinitely many combinations of sizes of three connected hyperedges.
Generalization to Four or More Hyperedges: The concept of hmotifs is easily generalized to four or more hyperedges. For example, a hmotif for four hyperedges can be defined as a binary vector of size indicating the emptiness of each region in the Venn diagram for four sets. We leave this generalization as future work and focus on the hmotifs for three hyperedges since they are already capable of characterizing local structures of realworld hypergraphs, as shown empirically in Section 5.
3.3 Characteristic Profile (CP)
What are the structural design principles of realworld hypergraphs distinguished from those of random hypergraphs? Below, we introduce the characteristic profile (CP), which is a tool for answering the above question using hmotifs.
Randomized Hypergraphs: While one might try to characterize the local structure of a hypergraph by absolute counts of each hmotifs’ instances in it, some hmotifs may naturally have many instances. Thus, for more accurate characterization, we need random hypergraphs to be compared against realworld hypergraphs. We obtain such random hypergraphs by randomizing a compared realworld hypergraph. To this end, we represent the hypergraph as a bipartite graph where and are the two partitions of nodes, and there exists an edge between if and only if . Then, we use the ChungLu bipartite graph generative model, which successfully preserves the degree distribution [5]. As a result, we obtain randomized hypergraphs in which the degree (i.e., the number of hyperedges that each node belongs to) distribution of nodes and the size distribution of hyperedges in are maintained.
Significance of Hmotifs: We measure the significance of each hmotif in a hypergraph by comparing the count of its instances against the count of them in random hypergraphs. Specifically, the significance of a hmotif in a hypergraph is defined as
(1) 
where is the number of instances of hmotif in , and is the average number of instances of hmotif randomized hypergraphs obtained as described above. We fixed to throughout this paper. This way of measuring significance was proposed in [42]
for network motifs as an alternative of normalized Z scores, which heavily depend on the graph size.
Characteristic Profile (CP): By normalizing and concatenating the significances of all hmotifs in a hypergraph, we obtain the characteristic profile (CP), which summarizes the local structural pattern in the hypergraph. Specifically, the characteristic profile of a hypergraph in is a vector of size , where each th element is
(2) 
Note that, for each , is between and . The CP is used in Section 5.3 to compare the local structural patterns of realworld hypergraphs from diverse domains.
4 Proposed Algorithms
Given a hypergraph, how can we count the instances of each hmotif? Once we count them in the original and randomized hypergraphs, the significance of each motif and the CP are obtained immediately by Eq. (1) and Eq. (2).
The problem of counting of hmotifs’ instances bears some similarity to the classic problem of counting network motifs’ instances. However, different from network motifs, which are defined solely based on pairwise interactions, hmotifs are defined based on nonpairwise interactions (e.g., ). Due to this difference, new approaches are required.
In this section, we present MoCHy (Motif Counting in Hypergraphs), which is a family of parallel algorithms for counting the instances of each hmotif in the input hypergraph. We first describe hypergraph projection, which is a preprocessing step of every version of MoCHy. Then, we present MoCHyE, which is for exact counting. After that, we present two different versions of MoCHyA, which are samplingbased algorithms for approximate counting. Lastly, we discuss parallel and onthefly implementations.
Throughout this section, we use to denote the hmotif that describes the connectivity pattern of an hmotif instance . We also use to denote the count of instances of hmotif .
4.1 Hypergraph Projection (Algorithm 1)
As a preprocessing step, every version of MoCHy builds the projected graph (see Section 3.1) of the input hypergraph , as described in Algorithm 1. To find the neighbors of each hyperedge (line 1), the algorithm visits each hyperedge that contains and satisfies (line 1) for each node (line 1). Then for each such , it adds to and increments (lines 1 and 1). The time complexity of this preprocessing step is given in Lemma 1.
Lemma 1 (Complexity of Hypergraph Projection).
The time complexity of Algorithm 1 is .
Proof.
Since and , Eq. (3) holds.
(3) 
4.2 Exact Hmotif Counting (Algorithm 2)
We present MoCHyE (MoCHy Exact), which exactly count the instances of each hmotif. The procedures of MoCHyE are described in Algorithm 2. For each hyperedge (line 2), each unordered pair of its neighbors, which forms an hmotif instance , is considered (line 2). If (i.e., if the corresponding hmotif is open), is considered only once. However, if (i.e., if the corresponding hmotif is closed), is considered two more times (i.e., when is chosen in line 2 and when is chosen in line 2). Based on these observations, given an hmotif instance , the corresponding count is incremented (line 2) only if or (line 2). This guarantees that each is counted exactly once. The time complexity of MoCHyE is given in Theorem 1, which is based on Lemma 2.
Lemma 2 (Time Complexity of Computing ).
Given the input hypergraph and its projected graph , for each hmotif instance , computing takes time.
Proof.
Assume , without loss of generality, and all sets and maps are implemented using hash tables. As defined in Section 3.2, is computed in time from the emptiness of the following sets: (1) , (2) , (3) , (4) , (5) , (6) , and (7) . We check their emptiness from their cardinalities. We obtain , , and , which are stored in , and their cardinalities in time. Similarly, we obtain , , and , which are stored in , in time. Then, we compute in time by checking for each node in whether it is also in both and . From these cardinalities, we obtain the cardinalities of the six other sets in time as follows:
Hence, the time complexity of computing is . ∎
Theorem 1 (Complexity of MoCHyE).
The time complexity of Algorithm 2 is .
Proof.
4.3 Approximate Hmotif Counting
We present two different versions of MoCHyA (MoCHy A
pproximate), which approximately count the instances of each hmotif. Both versions estimate the counts by exploring the input hypergraph partially through hyperedge and hyperwedge sampling, resp., and thus they are particularly useful for largescale hypergraphs. In addition, both versions yield unbiased estimates.
MoCHyA: Hyperedge Sampling (Algorithm 3):
MoCHyA, which is based on hyperedge sampling, is described in Algorithm 3. It repeatedly samples hyperedges from the hyperedge set uniformly at random with replacement (line 3). For each sampled hyperedge , the algorithm searches for all hmotif instances that contain (lines 33), and to this end, the hop and hop neighbors of in the projected graph are explored. After that, for each such instance of hmotif , the corresponding count is incremented (line 3). Lastly, each estimate is rescaled by multiplying it with (lines 33), which is the reciprocal of the expected number of times that each of the hmotif ’s instances is counted. Note that each hyperedge is expected to be sampled times, and each hmotif instance is counted whenever any of its three hyperedges is sampled. This rescaling makes each estimate unbiased, as formalized in Theorem 2.
Theorem 2 (Bias and Variance of MoCHyA).
For every hmotif t, Algorithm 3 provides an unbiased estimate of the count of its instances, i.e.,
(4) 
The variance of the estimate is
(5) 
where is the number of pairs of hmotif ’s instances that share hyperedges.
Proof.
See Appendix A. ∎
The time complexity of MoCHyA is given in Theorem 3.
Theorem 3 (Complexity of MoCHyA).
The average time complexity of Algorithm 3 is .
Proof.
Assume all sets and maps are implemented using hash tables. For a sample hyperedge , computing for every takes time, and by Lemma 2, computing for all considered hmotif instances takes time. Thus, from , the time complexity for processing a sample is
which can be written as
From this, linearity of expectation, is sampled, and is adjacent to the sample, the average time complexity per sample hyperedge becomes . Hence, the total time complexity for processing samples is .∎
MoCHyA^{+}: Hyperwedge Sampling (Algorithm 4):
MoCHyA^{+}, which provides a better tradeoff between speed and accuracy than MoCHyA, is described in Algorithm 4. Different from MoCHyA, which samples hyperedges, MoCHyA^{+} is based on hyperwedge sampling. It selects hyperwedges uniformly at random with replacement (line 4), and for each sampled hyperwedge , it searches for all hmotif instances that contain (lines 44). To this end, the hyperedges that are adjacent to or in the projected graph are considered (line 4). For each such instance of hmotif , the corresponding estimate is incremented (line 4). Lastly, each estimate is rescaled so that it unbiasedly estimates , as formalized in Theorem 4. To this end, each estimate is multiplied by the reciprocal of the expected number of times that each instance of hmotif is counted. Note that each instance of open and closed hmotifs contains and hyperwedges, respectively. Each instance of closed hmotifs is counted if one of the hyperwedges in it is sampled, while that of open hmotifs is counted if one of the hyperwedges in it is sampled. Thus, on average, each instance of open and closed hmotifs is counted and times, respectively.
Theorem 4 (Bias and Variance of MoCHyA^{+}).
For every hmotif t, Algorithm 4 provides an unbiased estimate of the count of its instances, i.e.,
(6) 
For every closed motif , the variance of the estimate is
(7) 
where is the number of pairs of hmotif ’s instances that share hyperwedges. For every open motif , the variance is
(8) 
Proof.
See Appendix B. ∎
The time complexity of MoCHyA^{+} is given in Theorem 5.
Theorem 5 (Complexity of MoCHyA^{+}).
The average time complexity of Algorithm 4 is .
Proof.
Assume all sets and maps are implemented using hash tables. For a sample hyperwedge , computing takes time, and by Lemma 2, computing for all considered hmotif instances takes time. Thus, from , the time complexity for processing a sample is which can be written as
From this, linearity of expectation, is included in the sample, and is included in the sample, the average time complexity per sample hyperwedge is . Hence, the total time complexity for processing samples is .∎
Comparison of MoCHyA and MoCHyA^{+}: Empirically, MoCHyA^{+} provides a better tradeoff between speed and accuracy than MoCHyA, as presented in Section 5.4. Below, we provide an analysis that supports this observation.
Assume that the numbers of samples in both algorithms are set so that . For each hmotif , since both estimates of MoCHyA and of MoCHyA^{+} are unbiased (see Eq. (4) and (6)), we only need to compare their variances. By Eq. (5), , and by Eq. (7) and Eq. (8), . By definition, , and thus . Moreover, in realworld hypergraphs, tends to be several orders of magnitude larger than the other terms (i.e., , , and ), and thus of MoCHyA tends to have much larger variance (and thus much larger estimation error) than of MoCHyA^{+}. Despite this fact, MoCHyA and MoCHyA^{+} have the same time complexity, which is (see Theorems 3 and 5). Hence, MoCHyA^{+} is expected to provide a better tradeoff between speed and accuracy than MoCHyA, as confirmed empirically in Section 5.4.
4.4 Parallel and Onthefly Implementations
We discuss parallelization of MoCHy and then onthefly computation of projected graphs.
Parallelization: All versions of MoCHy and hypergraph projection are easily parallelized. Specifically, we can parallelize hypergraph projection and MoCHyE by letting multiple threads process different hyperedges (in line 1 of Algorithm 1 and line 2 Algorithm 2, respectively) independently in parallel. Similarly, we can parallelize MoCHyA and MoCHyA^{+} by letting multiple threads sample and process different hyperedges (in line 3 of Algorithm 3) and hyperwedges (in line 4 of Algorithm 4), respectively, independently in parallel. The estimated counts of the same hmotif obtained by different threads are summed up only once before they are returned as outputs. We present some empirical results in Section 5.4.
Hmotif Counting without Projected Graphs: If the input hypergraph is large, computing its projected graph (Algorithm 3) is time and space consuming. Specifically, building takes time (see Lemma 1) and requires space, which often exceeds space required for storing . Thus, instead of precomputing entirely, we can build it incrementally while memoizing partial results within a given memory budget. For example, in MoCHyA^{+} (Algorithm 4), we compute the neighborhood of a hyperedge in (i.e., ) only if (1) a hyperwedge with (e.g., ) is sampled (in line 4) and (2) its neighborhood is not memoized.
This incremental computation of can be beneficial in terms of speed since it skips projecting the neighborhood of a hyperedge if no hyperwedges containing it is sampled. However, it can also be harmful if memoized results exceed the memory budget and some parts of need to be rebuilt multiple times. Then, given a memory budget in bits, how should we prioritize hyperedges if all their neighborhoods cannot be memoized? According to our experiments, despite their large size, memoizing the neighborhoods of hyperedges with high degree in makes MoCHyA^{+} faster than memoizing the neighborhoods of randomly chosen hyperedges or least recently used (LRU) hyperedges. In Section 5.4, we experimentally examine the effects of the memory budget on the speed of MoCHyA^{+}.
5 Experiments
Dataset  # hmotifs  

coauthDBLP  1,924,991  2,466,792  125M  26.3B 18M 
coauthgeology  1,256,385  1,203,895  37.6M  6B 4.8M 
coauthhistory  1,014,734  895,439  1.7M  83.2M 
contactprimary  242  12,704  2.2M  617M 
contacthigh  327  7,818  593K  69.7M 
emailEnron  143  1,512  87.8K  9.6M 
emailEU  998  25,027  8.3M  7B 
tagsubuntu  3,029  147,222  564M  4.3T 1.5B 
tagsmath  1,629  170,476  913M  9.2T 3.2B 
threadsubuntu  125,602  166,999  21.6M  11.4B 
threadsmath  176,445  595,749  647M  2.2T 883M 
In this section, we review our experiments that we design for answering the following questions:

[leftmargin=*]

Q1. Comparison with Random: Does counting instances of different hmotifs reveal structural design principles of realworld hypergraphs distinguished from those of random hypergraphs?

Q2. Comparison across Domains: Do characteristic profiles capture local structural patterns of hypergraphs unique to each domain?

Q3. Performance of Counting Algorithms: How fast and accurate are the different versions of MoCHy? Does the advanced version outperform the basic ones?
5.1 Experimental Settings
Machines: We conducted all the experiments on a machine with an AMD Ryzen 9 3900X CPU and 128GB RAM.
Implementations: We implemented every version of MoCHy using C++ and OpenMP.
Datasets: We used the following 11 realworld hypergraphs from five different domains:

[leftmargin=*]

tags (tagsubuntu and tagsmath): A node represents a tag. A hyperedge represents the tags attached to a post.

threads (threadsubuntu and threadsmath): A node represents a user. A hyperedge groups all users participating in a thread.
These hypergraphs are made public by the authors of [10]^{1}^{1}1https://www.cs.cornell.edu/~arb/data/, and in Table 3 we provide some statistics of the hypergraphs after removing duplicated hyperedges. We used MoCHyE for the coauthhistory dataset, the threadsubuntu dataset, and all datasets from the contact and email domains. For the other datasets, we used MoCHyA^{+} with , unless otherwise stated. For the tagsubuntu, tagsmath, and threadsmath datasets with trillions of hmotifs, we set the memory budget to of the edges in the projected graphs, as described in Section 4.4, while we precomputed the projected graphs for the other datasets. We used a single thread unless otherwise stated. We computed CPs based on five hypergraphs randomized as described in Section 3.2.
5.2 Q1. Comparison with Random
To answer Q1, we analyze the counts of different hmotifs’ instances in real and random hypergraphs. In Table 2, we report the (approximated) count of each hmotif ’s instances in each real hypergraph with the corresponding count averaged over five random hypergraphs that are obtained as described in Section 3.2. We rank hmotifs by the counts of their instances and examine the difference between the ranks in real and corresponding random hypergraphs. As seen in the table, the count distributions in real hypergraphs are clearly distinguished from those of random hypergraphs.
Hmotifs in Random Hypergraphs: We first notice from Table 2 that instances of hmotifs and appear much more frequently in random hypergraphs than in real hypergraphs from all domains. For example, instances of hmotif appear only about thousand times in the tagsmath dataset, while they appeared about million times (about more often) in the corresponding randomized hypergraphs. Additionally, in the threadsmath dataset, instances of hmotif appear about thousand times, while they appear about billion times (about more often). Instances of hmotifs and consist of a hyperedge and its two disjoint subsets (see Figure 4).
Hmotifs in Coauthorship Hypergraphs: We observe that instances of hmotifs , and appear more frequently in all three hypergraphs from the coauthorship domain than in the corresponding random hypergraphs. For example, while there is no single instance of hmotif in the corresponding random hypergraphs, there are about thousand such instances in the coauthhistory dataset. While there are only about instances of hmotif in the corresponding random hypergraphs, there are about million such instances (about more instances) in the coauthhistory dataset. As seen in Figure 4, in instances of hmotifs , , and , a hyperedge is overlapped with the two other overlapped hyperedges in three different ways.
Hmotifs in Contact Hypergraphs: Instances of hmotifs , , and are noticeably more common in both contact datasets than in the corresponding random hypergraphs. As seen in Figure 4, in instances of hmotifs , and , hyperedges are tightly connected and nodes are mainly located in the intersections of all or some hyperedges.
Hmotifs in Email Hypergraphs: Both email datasets contain particularly many instances of hmotifs and , compared to the corresponding random hypergraphs. As seen in Figure 4, instances of hmotifs and consist of three hyperedges where most nodes are contained in one hyperedge.
Hmotifs in Tags Hypergraphs: In addition to instances of hmotif , which are common in most real hypergrahps, instances of hmotif , where all seven regions are not empty (see Figure 4), particularly frequent in both tags datasets than in corresponding random hypergraphs.
Hmotifs in Threads Hypergraphs: Lastly, in both data sets from the threads domain, instances of hmotifs and are noticeably more frequent than expected from the corresponding random hypergrpahs.
5.3 Q2. Comparison across Domains
To answer Q2, we compare the characteristic profiles (CPs) of the realworld hypergraphs. In Figure 6, we present the CPs (i.e., the significances of the hmotifs) of each hypergraph. As seen in the figure, hypergraphs from the same domains have similar CPs. Specifically, all three hypergraphs from the coauthorship domain share extremely similar CPs, even when the absolute counts of hmotifs in them are several orders of magnitude different. Similarly, the CPs of both hypergraphs form the tags domain are extremely similar. However, the CPs of the hypergraphs from the coauthorship domain are clearly distinguished by them of the hypergraphs from the tags domain. While the CPs of the hypergraphs from the contact domain and the CPs of those from the email domain are similar for the most part, they are distinguished by the significance of hmotif 3. These observations confirm that CPs accurately capture local structural patterns of realworld hypergraphs.
To further verify the effectiveness of CPs based on hmotifs, we compare them with CPs based on network motifs. Specifically, we represent each hypergraph as a bipartite graph where and . That is, the two types of nodes in the transformed bipartite graph represent the nodes and hyperedges, resp., in the original hypergraph , and each edge in indicates that the node belongs to the hyperedge in . Then, we compute the CPs based on the network motifs consisting of to nodes^{2}^{2}2We used Motivo [14], which is a stateoftheart approximate algorithm for network motif counting. Lastly, based on both CPs, we compute the similarity matrices (specifically, correlation coefficient matrices) of the realworld hypergraphs. As seen in Figure 7, the domains of the realworld hypergraphs are distinguished more clearly by the CPs based on hmotifs than by the CPs based on network motifs. Numerically, when the CPs based on hmotifs are used, the average correlation coefficient is within domains and across domains, and thus the gap is . However, when the CPs based on network motifs are used, the average correlation coefficient is within domains and across domains, and thus the gap is just . These results support that hmotifs play a key role in capturing the local structural patterns of realworld hypergraphs.
5.4 Q3. Performance of Counting Algorithms
To answer Q3, we test the speed and accuracy of all versions of MoCHy under various settings. To this end, we measure elapsed time and relative error defined as
for MoCHyA and MoCHyA^{+}, respectively.
Speed and Accuracy: In Figure 8, we report the elapsed time and relative error of all versions of MoCHy on the different datasets where MoCHyE terminates within a reasonable time. The numbers of samples in MoCHyA and MoCHyA^{+} are set to percent of the counts of hyperedges and hyperwedges, respectively. As seen in the figure, MoCHyA^{+} provides the best tradeoff between speed and accuracy. For example, in the threadsubuntu dataset, MoCHyA^{+} provides lower relative error than MoCHyA, consistently with our theoretical analysis (see the last paragraph of Section 4.3). Moreover, in the same dataset, MoCHyA^{+} is faster than MoCHyE with little sacrifice on accuracy.
Effects of the Sample Size on CPs: In Figure 9, we report the CPs obtained by MoCHyA^{+} with different numbers of hyperwedge samples on datasets. Even with a smaller number of samples, the CPs are estimated near perfectly.
Parallelization: In Figure 10, we measure the running times of MoCHyE and MoCHyA^{+} with different numbers of threads on the threadsubuntu dataset. As seen in the figure, both algorithms achieve significant speedups with multiple threads. Specifically, with threads, MoCHyE and MoCHyA^{+} () achieve speedups of and , respectively.
Effects of Onthefly Computation on Speed: We analyze the effects of the onthefly computation of projected graphs (discussed in Section 4.4) on the speed of MoCHyA^{+} under different memory budgets for memoization. To this end, we use the threadsubuntu dataset, and we set the memory budgets so that up to of the edges in the projected graph can be memoized. The results are shown in Figure 11. As the memory budget increases, MoCHyA^{+} becomes faster, avoiding repeated computation. Due to our careful prioritization scheme based on degree, by memoizing only of the edges, MoCHyA^{+} achieves speedups of about .
6 Conclusions
In this work, we introduce hypergraph motifs (hmotifs), and using them, we investigate the local structures of realworld hypergraphs from different domains. We summarize our contributions as follows:

[leftmargin=*]

Novel Concepts: We define 26 hmotifs, which describe connectivity patterns of connected hyperedges in a unique and exhaustive way, independently of the sizes of hyperedges (Figure 4).

Fast and Provable Algorithms: We propose parallel algorithms for (approximately) counting every hmotif’s instances, and we theoretically and empirically analyze their speed and accuracy. Both approximate algorithms yield unbiased estimates (Theorems 2 and 4), and especially the advanced one is up to faster than the exact algorithm, with little sacrifice on accuracy (Figure 8).
Future research directions include (1) extending hmotifs to richer hypergraphs, such as heterogenous or temporal hypergraphs, and (2) incorporating hmotifs into various tasks, such as hypergraph embedding, ranking, and clustering.
Appendix A Proof of Theorem 2
We let
be a random variable indicating whether the
th sampled hyperedge (in line 3 of Algorithm 3) is included in the th instance of hmotif or not. That is, if the hyperedge is included in the instance, and otherwise. We let be the number of times that hmotif ’s instances are counted while processing sampled hyperedges. That is,(9) 
Then, by lines 33 of Algorithm 3,
(10) 
Proof of the Bias of (Eq. (4)):
Proof.
Since each hmotif instance contains three hyperedges, the probability that each
th sampled hyperedge is contained in each th instance of hmotif is(11) 
From linearity of expectation,
Comments
There are no comments yet.