Efficient Sampling Algorithms for Approximate Temporal Motif Counting (Extended Version)

07/28/2020 ∙ by Jingjing Wang, et al. ∙ National University of Singapore Singapore Management University 0

A great variety of complex systems ranging from user interactions in communication networks to transactions in financial markets can be modeled as temporal graphs, which consist of a set of vertices and a series of timestamped and directed edges. Temporal motifs in temporal graphs are generalized from subgraph patterns in static graphs which take into account edge orderings and durations in addition to structures. Counting the number of occurrences of temporal motifs is a fundamental problem for temporal network analysis. However, existing methods either cannot support temporal motifs or suffer from performance issues. In this paper, we focus on approximate temporal motif counting via random sampling. We first propose a generic edge sampling (ES) algorithm for estimating the number of instances of any temporal motif. Furthermore, we devise an improved EWS algorithm that hybridizes edge sampling with wedge sampling for counting temporal motifs with 3 vertices and 3 edges. We provide comprehensive analyses of the theoretical bounds and complexities of our proposed algorithms. Finally, we conduct extensive experiments on several real-world datasets, and the results show that our ES and EWS algorithms have higher efficiency, better accuracy, and greater scalability than the state-of-the-art sampling method for temporal motif counting.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Graphs are one of the most fundamental data structures that are widely used for modeling complex systems across diverse domains from bioinformatics [30], to neuroscience [38], to social sciences [5]. Modern graph datasets increasingly incorporate temporal information to describe the dynamics of relations over time. Such graphs are referred to as temporal graphs [11] and typically represented by a set of vertices and a sequence of timestamped and directed edges between vertices called temporal edges. For example, a communication network [49, 9, 45, 46, 47] can be denoted by a temporal graph where each person is a vertex and each message sent from one person to another is a temporal edge. Similarly, computer networks and financial transactions can also be modeled as temporal graphs. Due to the ubiquitousness of temporal graphs, they have attracted much attention [9, 27, 49, 20, 25, 6, 32, 7] recently.

One fundamental problem in temporal graphs with wide real-world applications such as network characterization [27], structure prediction [22], and fraud detection [18], is to count the number of occurrences of small (connected) subgraph patterns (i.e., motifs [24]). To capture the temporal dynamics in network analysis, the notion of motif [27, 16, 17, 22] in temporal graphs is more general than its counterpart in static graphs. It takes into account not only the subgraph structure (i.e., subgraph isomorphism [36, 8]) but also the temporal information including edge ordering and motif duration. As an illustrative example, and in Figure 1 are different temporal motifs. Though and have exactly the same structure, they are different in the ordering of edges. Consequently, although there has been a considerable amount of work on subgraph counting in static graphs [41, 14, 42, 2, 29, 44, 15, 35, 34, 4], they cannot be used for temporal motif counting directly.

Generally, it is a challenging task to count temporal motifs. Firstly, the problem is at least as hard as subgraph counting in static graphs, whose time complexity increases exponentially with the number of edges in the query subgraph. Secondly, it becomes even more computationally difficult because the temporal information is considered. For example, counting the number of instances of -stars is simple in static graphs; however, counting temporal -stars is proven to be NP-hard [22] due to the combinatorial nature of edge ordering. Thirdly, temporal graphs are a kind of multigraph that is permitted to have multiple edges between the same two vertices at different timestamps. As a result, there may exist many different instances of a temporal motif within the same set of vertices, which leads to more challenges for counting problems. There have been a few methods for exact temporal motif counting [27] or enumeration [23, 18]. However, they suffer from efficiency issues and often cannot scale well in massive temporal graphs with hundreds of millions of edges [22].

Figure 1: Examples for temporal motifs

In many scenarios, it is not necessary to count motifs exactly, and finding an approximate number is sufficient for practical use. A recent work [22] has proposed a sampling method for approximate temporal motif counting. It partitions a temporal graph into equal-time intervals, utilizes an exact algorithm [23] to count the number of motif instances in a subset of intervals, and computes an estimate from the per-interval counts. However, this method still cannot achieve satisfactory performance in massive datasets. On the one hand, it fails to provide an accurate estimate when the sampling rate and length of intervals are small. On the other hand, its efficiency does not significantly improve upon that of exact methods when the sampling rate and length of intervals are too large.

Our Contributions: In this paper, we propose more efficient and accurate sampling algorithms for approximate temporal motif counting. First of all, we propose a generic Edge Sampling (ES) algorithm to estimate the number of instances of any -vertex -edge temporal motif in a temporal graph. The basic idea of our ES algorithm is to first uniformly draw a set of random edges from the temporal graph, then exactly count the number of local motif instances that contain each sampled edge by enumerating them, and finally compute the global motif count from local counts. The ES algorithm exploits the BackTracking (BT) algorithm [36, 23]

for subgraph isomorphism to enumerate local motif instances. We devise simple heuristics to determine the matching order of a motif for the BT algorithm to reduce the search space.

Furthermore, temporal motifs with vertices and edges (i.e., triadic patterns) are one of the most important classes of motifs, whose distribution is an indicator to characterize temporal networks [15, 27, 37, 5]. Therefore, we propose an improved Edge-Wedge Sampling (EWS) algorithm that combines edge sampling with wedge sampling [15, 35] specialized for counting any -vertex -edge temporal motif. Instead of enumerating all instances containing a sampled edge, the EWS algorithm generates a sample of temporal wedges (i.e., -vertex -edge motifs) and estimates the number of local instances by counting how many edges can match the query motif together with each sampled temporal wedge. In this way, EWS avoids the computationally intensive enumeration and greatly improves the efficiency upon ES. Moreover, we analyze the theoretical bounds and complexities of both ES and EWS.

Finally, we test our algorithms on several real-world datasets. The experimental results confirm the efficiency and effectiveness of our algorithms: ES and EWS can provide estimates with relative errors less than and in and seconds on a temporal graph with over M edges, respectively. In addition, they run up to and times faster than the state-of-the-art sampling method while having lower estimation errors.

Organization: The remainder of this paper is organized as follows. Section 2 reviews the related work. Section 3 introduces the background and formulation of temporal motif counting. Section 4 presents the ES and EWS algorithms for temporal motif counting and analyzes them theoretically. Section 5 describes the setup and results of the experiments. Finally, Section 6 provides some concluding remarks.

2 Related Work

Random Sampling for Motif Counting: In recent years, there have been great efforts to (approximately) count the number of occurrences of a motif in a large graph via random sampling. First of all, many sampling methods such as subgraph sampling [33], edge sampling [1, 21, 43], color sampling [26], neighborhood sampling [28], wedge sampling [13, 15, 35, 34], and reservoir sampling [3], were proposed for approximate triangle counting (see [48] for an experimental analysis). Moreover, sampling methods were also used for estimating more complex motifs, e.g., 4-vertex motifs [14, 31], 5-vertex motifs [29, 41, 42, 44], motifs with 6 or more vertices [2], and -cliques [12]. However, all above methods were proposed for static graphs and did not consider the temporal information and ordering of edges. Thus, they could not be applied to temporal motif counting directly.

Motifs in Temporal Networks: Prior studies have considered different types of temporal network motifs. Viard et al. [39, 40] and Himmel et al. [10] extended the notion of maximal clique to temporal networks and proposed efficient algorithms for maximal clique enumeration. Li et al. [20] proposed the notion of -persistent -core to capture the persistence of a community in temporal networks. However, these notions of temporal motifs were different from ours since they did not take edge ordering into account. Zhao et al. [49] and Gurukar et al. [9] studied the communication motifs, which are frequent subgraphs to characterize the patterns of information propagation in social networks. Kovanen et al. [17] and Kosyfaki et al. [16] defined the flow motifs to model flow transfer among a set of vertices within a time window in temporal networks. Although both definitions accounted for edge ordering, they were more restrictive than ours because the former assumed any two adjacent edges must occur within a fixed time span while the latter assumed edges in a motif must be consecutive events for a vertex [27].

Temporal Motif Counting & Enumeration: There have been several existing studies on counting and enumerating temporal motifs. Paranjape et al. [27] first formally defined the notion of temporal motifs we use in this paper. They proposed exact algorithms for counting temporal motifs based on subgraph enumeration in static graphs and timestamp-based pruning. Kumar and Calders [18] proposed an efficient algorithm called 2SCENT to enumerate all simple temporal cycles in a directed interaction network. Although 2SCENT was shown to be effective for cycles, it could not be used for enumerating temporal motifs of any other type. Mackey et al. [23] proposed an efficient BackTracking algorithm for temporal subgraph isomorphism. The algorithm could count temporal motifs exactly by enumerating all of them. Liu et al. [22] proposed an interval-based sampling framework for counting temporal motifs. To the best of our knowledge, this is the only existing work on approximate temporal motif counting via sampling. In this paper, we present two improved sampling algorithms for temporal motif counting and compare them with the algorithms in [27, 23, 18, 22] for evaluation.

3 Preliminaries

In this section, we formally define temporal graphs, temporal motifs, and the problem of temporal motif counting on a temporal graph. Here, we follow the definition of temporal motifs in [27, 22, 23] for its simplicity and generality. Other types of temporal motifs have been discussed in Section 2.

Temporal Graph: A temporal graph is defined by a set of vertices and a sequence of temporal edges among vertices in . Each temporal edge where and is a timestamped directed edge from to at time . There may be many temporal edges from to at different timestamps (e.g., a user can comment on the posts of another user many times on Reddit). For ease of presentation, we assume the timestamp of each temporal edge is unique so that the temporal edges in are strictly ordered. Note that our algorithms can also handle the case when timestamps are non-unique by using any consistent rule to break ties.

Temporal Motif: We formalize the notion of temporal motifs [27, 22] in the following definition.

Definition 1 (Temporal Motif).

A temporal motif consists of a (connected) graph with a set of vertices and a set of edges , and an ordering on the edges in .

Intuitively, a temporal motif can be represented as an ordered sequence of edges . Given a temporal motif as a template pattern, we aim to count how many times this pattern appears in a temporal graph . Furthermore, we only consider the instances where the pattern is formed within a short time span. For example, an instance formed in an hour is more interesting than one formed accidentally in one year on a communication network [49, 9, 27]. Therefore, given a temporal graph and a temporal motif , our goal is to find a sequence of edges such that (1) exactly matches (i.e., is isomorphic to) , (2) is in the same order as specified by , and (3) all edges in occur within a time span of at most . We call such an edge sequence as a -instance [27, 22] of and the difference between and as the duration of instance . The formal definition is given in the following.

Definition 2 (Motif -instance).

A sequence of edges () from a temporal graph is a -instance of a temporal motif if (1) there exists a bijection between the vertex sets of and such that and for ; and (2) the duration is at most , i.e., .

Figure 2: Example for temporal graph and motifs
Example 1.

In Figure 2(a), we illustrate a temporal graph with vertices and temporal edges. Let us consider the problem of finding all -instances () of temporal motif in Figure 2(b). As shown in Figure 2(c), there are valid -instances of found. These instances can match in terms of both structure and edge ordering and their durations are within . In addition, we also give invalid instances of , which are isomorphic to but violate either the edge ordering or duration constraint.

Temporal Motif Counting: According to the above notions, we present the temporal motif counting problem studied in this paper.

Definition 3 (Temporal Motif Counting).

For a temporal graph , a temporal motif , and a time span , the temporal motif counting problem returns the number of -instances of appeared in .

The temporal motif counting problem has proven to be NP-hard for very simple motifs, e.g. -stars [22], because the edge ordering is taken into account. According to previous results [22], although there is a simple polynomial algorithm to count the number of -stars on a static graph, it is NP-hard to exactly count the number of temporal -stars. Typically, counting temporal motifs exactly on massive graphs with millions or even billions of edges is a computationally intensive task [27, 22]. Therefore, we focus on designing efficient and scalable sampling algorithms for estimating the number of temporal motifs approximately in Section 4. The frequently used notations are summarized in Table 1.

Symbol Description
Temporal graph
Set of vertices and edges in
Number of vertices and edges in
Temporal motif
Set of vertices and edges in
Number of vertices and edges in
Maximum time span of a motif instance
Motif -instance
Number of -instances of in
Unbiased estimator of
Probability of edge sampling
Set of sampled edges from
Number of -instances of containing edge
Number of -instances of when is mapped to
Probability of wedge sampling
Temporal wedge
Number of -instances of containing
Temporal wedge pattern for when is mapped to
Set of sampled -instances of
Unbiased estimator of
Table 1: Frequently used notations

4 Our Algorithms

In this section, we present our proposed algorithms for approximate temporal motif counting in detail. We first describe our generic Edge Sampling (ES) algorithm in Section 4.1. Then, we introduce our improved EWS algorithm specific for counting -vertex -edge temporal motifs in Section 4.2

. In addition, we theoretically analyze the expected values and variances of the estimates returned by both algorithms. Finally, we discuss the streaming implementation of our algorithms in Section 

4.3.

4.1 The Generic Edge Sampling Algorithm

The Edge Sampling (ES) algorithm is motivated by an exact subgraph counting algorithm called edge iterator [48]. Given a temporal graph , a temporal motif , and a time span , we use to denote the number of local -instances of containing an edge . To count all -instances of in exactly, we can simply count for each and then sum them up. In this way, each instance is counted times and the total number of instances is equal to the sum divided by , i.e., .

Based on the above idea, we propose the ES algorithm for estimating : For each edge , we randomly sample it and compute with fixed probability . Then, we acquire an unbiased estimator of by adding up for each sampled edge and scaling the sum by a factor of , i.e., where is the set of sampled edges.

Now the remaining problem becomes how to compute for an edge . The ES algorithm adopts the well-known BackTracking algorithm [36, 23] to enumerate all -instances that contain an edge for computing . Specifically, the BackTracking algorithm runs times for each edge ; in the th run, it first maps edge to the th edge of and then uses a tree search to find all different combinations of the remaining edges that can form -instances of with edge . Let be the number of -instances of where is mapped to . It is obvious that is equal to the sum of for , i.e., .

Input: Temporal graph , temporal motif , time span , edge sampling probability .
Output: Estimator of the number of -instances of in
1 Initialize ;
2 foreach  do
3       Toss a biased coin with success probability ;
4       if success then
5             ;
6            
7      
8foreach  do
9       Set ;
10       for  do
11             Generate an initial instance by mapping to ;
12             Run BackTracking on starting from to find the set is a -instance of where is mapped to ;
13             Set and ;
14            
15      
16return ;
Algorithm 1 Edge Sampling

We depict the procedure of our ES algorithm in Algorithm 1. The first step of ES is to generate a random sample of edges from the edge set where the probability of adding any edge is (Lines 11). Then, in the second step (Lines 11), it counts the number of local -instances of for each sampled edge by running the BackTracking algorithm to enumerate each instance that is a -instance of and maps to for . Note that BackTracking (BT) runs on a subset of which consists of all edges with timestamps from to for edge since it is safe to ignore any other edge due to the duration constraint. Here, we omit the detailed procedure of the BT algorithm because it generally follows an existing algorithm for subgraph isomorphism in temporal graphs [23]. The main difference between our algorithm and the one in [23] lies in the matching order, which will be discussed later. After counting for each sampled edge , it finally returns an estimate of (Line 1).

Matching Order for BackTracking: Now we discuss how to determine the matching order of a temporal motif. The BT algorithm in [23] adopts a time-first matching order: it always matches the edges of in order of . The advantage of this matching order is that it best exploits the temporal information for search space pruning. For a partial instance after is mapped, the search space for mapping is restricted to . However, the time-first matching order may not work well in the ES algorithm. First, it does not consider the connectivity of the matching order: If is not connected with any prior edge, it has to be mapped to all edges in , which may lead to a large number of redundant partial matchings. Second, the time-first order is violated by Line 1 of Algorithm 1 when since it first maps to .

In order to overcome the above two drawbacks, we propose two heuristics to determine the matching order of a given motif for reducing the search space, and generate matching orders for , in each of which () is placed first: (1) enforcing connectivity: For each , the th edge in the matching order must be adjacent to at least one prior edge that has been matched; (2) boundary edge first: If there are multiple unmatched edges that satisfy the connectivity constraint, the boundary edge (i.e., the first or last unmatched edge in the ordering of ) will be matched first. The first rule can avoid redundant partial matchings and the second rule can restrict the temporal range of tree search, both of which are effective for search space pruning.

Example 2.

We consider how to decide the matching orders of (i.e., -simple temporal cycle) in Figure 2(b). When is placed first, we can select or as the second edge according to the enforcing connectivity rule; and is selected according to the boundary edge first rule. Then, either or can be selected as the next edge since they both satisfy two rules. Therefore, either or is a valid matching order. Accordingly, , , and are valid matching orders when , , and in are placed first, respectively.

Figure 3: Example of enumerating -instances () of motif for edge in Figure 2 using Backtracking
Example 3.

In Figure 3, we show how to use Backtracking to enumerate -instances of () for in Figure 2. There are three tree search procedures in each of which is mapped to , , and , respectively. The condition of each mapping step is given in form of where and are the starting and ending vertices and is the range of timestamps. Here, “” means that it can be mapped to an arbitrary unmapped vertex. Moreover, we use ‘’ and ‘✕’ to denote a successful matching and a failed partial matching, respectively. We find three -instances of and thus . When we run ES with and , since the numbers of -instances containing and are respectively and , we can compute .

Theoretical Analysis: Next, we analyze the estimate returned by Algorithm 1 theoretically. We first prove that is an unbiased estimator of in Theorem 1. The variance of is given in Theorem 2.

Theorem 1.

The expected value of returned by Algorithm 1 is .

Proof.

Here, we consider the edges in are indexed by and use an indicator to denote whether the th edge is sampled, i.e.,

Then, we have

(1)

Next, based on Equation 1 and the fact that , we have

and conclude the proof. ∎

Theorem 2.

The variance of returned by Algorithm 1 is at most .

Proof.

According to Equation 1, we have

Because the indicators and are independent if , we have for any . In addition, . Based on the above results, we have

and conclude the proof. ∎

Finally, we can acquire Theorem 3 by applying Chebyshev’s inequality to the result of Theorem 2.

Theorem 3.

Proof.

By applying Chebyshev’s inequality, we have and thus prove the theorem by substituting with according to Theorem 2. ∎

According to Theorem 3, we can say is an -estimator of for parameters , i.e., , when .

Time Complexity: We first analyze the time complexity of computing for an edge . For BackTracking, the search space of each matching step is at most the number of (in-/out-)edges within range or connected with a vertex . Here, we use to denote the maximum number of (in-/out-)edges connected with one vertex within any -length time interval. The time complexity of BackTracking is and thus the time complexity of computing is . Therefore, ES provides an -estimator of in time.

4.2 The Improved EWS Algorithm

Figure 4: Temporal wedges, temporal -stars, and temporal triangles

The ES algorithm in Section 4.1 is generic and able to count any connected temporal motif. Nevertheless, there are still opportunities to further reduce the computational overhead of ES when the query motif is limited to -vertex -edge temporal motifs (i.e., triadic patterns), which are one of the most important classes of motifs to characterize temporal networks [15, 27, 37, 5]. In this section, we propose an improved Edge-Wedge Sampling (EWS) algorithm that combines edge sampling with wedge sampling for counting -vertex -edge temporal motifs.

Wedge sampling [15, 35, 34, 48] is a widely used method for triangle counting. Its basic idea is to draw a sample of wedges (i.e., -vertex -edge subgraph patterns) uniformly from a graph and check the ratio of “closed wedges” (i.e., form a triangle in the graph) to estimate the number of triangles. However, traditional wedge-sampling methods are proposed for undirected static graphs and cannot be directly used on temporal graphs. First, they consider that all wedges are isomorphic and treat them equally. But there are four temporal wedge patterns with different edge directions and orderings as illustrated in Figure 4(a). Second, they are designed for simple graphs where one wedge can form at most one triangle. However, since temporal graphs are multigraphs and there may exist multiple edges between the same two vertices, one temporal wedge can participate in more than one instance of a temporal motif. Therefore, in the EWS algorithm, we extend wedge sampling for temporal motif counting by addressing both issues.

The detailed procedure of EWS is presented in Algorithm 2. First of all, it uses the same method as ES to sample a set of edges (Line 2). For each sampled edge and , it also maps to for computing (Line 2), i.e., the number of -instances of where is mapped to . But, instead of running BackTracking to compute exactly, it utilizes temporal wedge sampling to estimate approximately without fully enumeration (Lines 22), which is divided into two subroutines as discussed later. At last, it acquires an estimate of from each estimate of using a similar method to ES (Line 2).

Input: Temporal graph , temporal motif , time span , edge sampling probability , wedge sampling probability .
Output: Estimator of the number of -instances of in
1 Generate using Line 11 of Algorithm 1;
2 foreach  do
3       for  do
4             Map edge to ;
5             Initialize and ;
6             if  is a temporal -star then
7                   Select including centered at the center of ;
8                  
9             else if  is a temporal triangle then
10                   Select including centered at the vertex mapped to the one with a lower degree in and ;
11                  
12             all edges that form -instances of with ;
13             foreach   do
14                   Add a -instance of comprising and to with probability ;
15                  
16             foreach   do
17                   Let be the number of edges that form -instances of together with ;
18                   ;
19                  
20            
21      
22return ;
Algorithm 2 Edge-Wedge Sampling

Sample Temporal Wedges (Lines 22): The first step of temporal wedge sampling is to determine which temporal wedge pattern is to be matched according to the query motif and the mapping from to . Specifically, we categorize 3-vertex 3-edge temporal motifs into two types, i.e., temporal -stars and temporal triangles as shown in Figure 4, based on whether they are closed. Interested readers may refer to [27] for a full list of all -vertex -edge temporal motifs. For a star or wedge pattern, the vertex connected with all edges is its center. Given that has been mapped to , EWS should find a temporal wedge pattern containing from for sampling. Here, different strategies are adopted to determine for star and triangle motifs (Lines 22): If is a temporal -star, it must select that contains and has the same center as ; If is a temporal triangle, it may use the vertex mapped to either or as the center to generate a wedge pattern. In this case, the center of will be mapped to the vertex with a lower degree between and for search space reduction. After deciding , it enumerates all edges that form a -instance of together with as from the adjacency list of the central vertex (Line 2). By selecting each edge with probability , it generates a sample of -instances of (Lines 22).

Estimate (Lines 22): Now, it estimates from the set of sampled temporal wedges. For each , it counts the number of -instances of that contain (Line 2). Specifically, after matching with , it can determine the starting and ending vertices as well as the temporal range for the mapping the third edge of . For the fast computation of , EWS maintains a hash table that uses an ordered combination () as the key and a sorted list of the timestamps of all edges from to as the value on the edge set of . In this way, can be computed by a hash search followed by at most two binary searches on the sorted list. Finally, can be estimated by summing up for each (Line 2), i.e., .

Example 4.

In Figure 5, we show how to compute using EWS on the temporal graph in Figure 2. In this example, is set to , i.e., all temporal wedges found are sampled. When is mapped to of in Figure 4, we have and find instances and of . Then, we acquire and and thus . For an edge mapped to of in Figure 4, is used as the central vertex since . Then, we have and there is only one instance of found. As , we get accordingly.

Theoretical Analysis: Next, we analyze the estimate returned by Algorithm 2 theoretically. We prove the unbiasedness and variances of in Theorem 4 and Theorem 5, respectively. Detailed proofs are also provided in the technical report.

Theorem 4.

The expected value of returned by Algorithm 2 is .

Proof.

By applying the result of Theorem 1, we only need to show to prove Theorem 4. Here, we index the edges in by and use an indicator to denote whether the wedge w.r.t. the th edge in is sampled. We have the following equality:

and conclude the proof. ∎

Figure 5: Examples for the EWS algorithm
Theorem 5.

The variance of returned by Algorithm 2 is at most .

Proof.

Let us index the edges in by and the edges in when is mapped to by where . Similar to the proof of Theorem 2, we have

where is the number of -instances of containing a temporal wedge and is its indicator, the second equality holds for the independence of and , the third equality holds because , and the last inequality holds for . ∎

According to the result of Theorem 5 and Chebyshev’s inequality, we have and is an -estimator of for parameters when .

Time Complexity: We first analyze the time to compute . First, is bounded by the maximum number of (in-/out-)edges connected with one vertex within any -length time interval, i.e., . Second, the time to compute using a hash table is where is the maximum number of edges between any two vertices. Therefore, the time complexity per edge in EWS is . This is lower than time per edge in ES (). Finally, EWS provides an -estimator of in time.

4.3 Streaming Implementation

To deal with a dataset that is too large to fit in memory or generated in a streaming manner, it is possible to adapt our algorithms to a streaming setting. Assuming that all edges are sorted in chronological order, our algorithms can determine whether to sample an edge or not when it arrives. Then, for each sampled edge , we only need the edges with timestamps in to compute its local count or . After a one-pass scan over the temporal graph stream, we can obtain an estimate of the number of a temporal motif in the stream. Generally, our algorithms can process any temporal graph stream in one pass by always maintaining the edges in the most recent time interval of length while having the same theoretical bounds as in the batch setting.

5 Experimental Evaluation

In this section, we evaluate the empirical performance of our proposed algorithms on real-world datasets. We first introduce the experimental setup in Section 5.1. The experimental results are presented in Section 5.2.

5.1 Experimental Setup

Experimental Environment: All experiments were conducted on a server running Ubuntu 18.04.1 LTS with an Intel® Xeon® Gold 6140 2.30GHz processor and 250GB main memory. All datasets and our code are publicly available222https://github.com/jingjing-hnu/Temporal-Motif-Counting. We downloaded the code333http://snap.stanford.edu/temporal-motifs/444https://github.com/rohit13k/CycleDetection555https://gitlab.com/paul.liu.ubc/sampling-temporal-motifs of baselines published by the authors and followed the instructions for compilation and usage. All algorithms were implemented in C++11 compiled by GCC v7.4 with -O3 optimizations, and ran on a single thread.

Dataset Time span
AU years
SU years
SO years
BC years
RC years
Table 2: Statistics of datasets

Datasets: We used five different real-world datasets in our experiments including AskUbuntu (AU), SuperUser (SU), StackOverflow (SO), BitCoin (BC), and RedditComments (RC). All datasets were downloaded from publicly available sources like the SNAP repository [19]. Each dataset is a sequence of temporal edges in chronological order. We report the statistics of these datasets in Table 2, where is the number of vertices, is the number of (static) edges, is the number of temporal edges, and time span is the overall time span of the entire dataset.

Algorithms: The algorithms compared are listed as follows.

  • EX: An exact algorithm for temporal motif counting in [27]. The available implementation is applicable only to -edge motifs and cannot support motifs with or more edges (e.g., Q5 in Figure 6).

  • 2SCENT: An algorithm for simple temporal cycle (e.g., Q4 and Q5 in Figure 6) enumeration in [18].

  • BT: A BackTracking algorithm for temporal subgraph isomorphism in [23]. It provides the exact count of any temporal motif by enumerating all of them.

  • IS-BT: An interval-based sampling algorithm for temporal motif counting in [22]. BT [23] is used as a subroutine for any motif with more than vertices.

  • ES: Our generic edge sampling algorithm for temporal motif counting in Section 4.1.

  • EWS: Our improved algorithm that combines edge sampling with wedge sampling for counting temporal motifs with vertices and edges (e.g. Q1–Q4 in Figure 6) in Section 4.2.

Figure 6: Query motifs

Queries: The five query motifs we use in the experiments are listed in Figure 6. Since different algorithms specialize in different types of motifs, we select a few motifs that can best represent the specializations of all algorithms. As discussed above, an algorithm may not be applicable to some of the motifs. In this case, the algorithm is ignored in the experiments on these motifs.

Performance Measures: The efficiency is measured by the CPU time (in seconds) of an algorithm to count a query motif in a temporal graph. The accuracy of a sampling algorithm is measured by the relative error where is the exact number of instances of a query motif in a temporal graph and is an estimate of returned by an algorithm. In each experiment, we run all algorithms times and use the average CPU time and relative errors for comparison.

5.2 Experimental Results

Dataset Motif EX 2SCENT BT IS-BT ES EWS
time (s) time (s) time (s) error time (s) error time (s) error time (s)
AU Q1 1.8 0.758 4.84% 0.402/1.9x 4.32% 0.059/12.8x 4.32% 0.027/28.1x
Q2 1.104 4.16% 0.434/2.5x 4.57% 0.048/23.0x 4.57% 0.029/38.1x
Q3 2.3 0.884 3.97% 0.50/1.8x *3.73% *0.605/1.5x *3.73% *0.183/4.8x
Q4 23.68 1.038 4.67% 0.492/2.1x *4.63% *0.628/1.7x *4.63% *0.173/6x
Q5 1.262 3.98% 0.536/2.4x *4.62% *0.322/3.9x
SU Q1 3.26 1.499 3.99% 0.620/2.4x 3.06% 0.102/14.7x 3.06% 0.052/28.8x
Q2 1.650 3.23% 0.671/2.5x 2.47% 0.083/19.9x 2.47% 0.046/35.9x
Q3 4.6 1.506 4.85% 0.723/2.1x 4.66% 0.113/13.3x 4.66% 0.030/50.2x
Q4 46.0 1.434 3.79% 0.725/2.0x 4.63% 0.128/11.2x 4.63% 0.042/34.1x
Q5 1.521 4.55% 0.759/2.0x *4.52% *0.453/3.4x
SO Q1 169 105.8 4.82% 8.626/12.3x 0.97% 4.419/23.9x 1.22% 1.528/69.2x
Q2 110.7 4.82% 27.48/4.0x 0.20% 3.985/27.8x 0.89% 1.514/73.1x
Q3 466 107.4 4.30% 25.70/4.2x 1.36% 4.031/26.6x 3.6% 1.235/87x
Q4 243.7 105.5 4.90% 6.775/15.6x 1.78% 3.936/26.8x 3.31% 1.153/91.5x
Q5 91.83 4.91% 9.451/9.7x 3.48% 1.505/61.0x
BC Q1 8143 220.0 4.75% 50.02/4.4x 0.64% 59.12/3.7x 0.67% 9.463/23.2x
Q2 399.8 4.90% 125.1/3.2x 1.11% 34.74/11.5x 1.16% 8.126/49.2x
Q3 8116 396.8 3.89% 90.19/4.4x 1.49% 41.49/9.6x 3.02% 2.121/187x
Q4 473.7 473.4 4.93% 95.47/5.0x 0.83% 37.43/12.6x 1.91% 2.262/209x
Q5 596.4 4.83% 319.7/1.9x 2.92% 20.47/29.1x
RC Q1 2799 1966 4.76% 840.5/2.3x 3.27% 257.4/7.6x 3.36% 31.49/62.4x
Q2 2113 4.67% 428/4.9x 0.63% 120.6/17.5x 0.6% 30.57/69.1x
Q3 2069 4.61% 784.4/2.6x 2.42% 76.09/27.2x 2.27% 16.17/128x
Q4 2245 1897 4.86% 683/2.8x 3.47% 68.60/27.7x 4.57% 15.91/119x
Q5 1613 4.41% 706.6/2.3x *4.32% *120.3/13.4x
Table 3: Running time (in seconds) and average errors () of all algorithms on each dataset. We use “—” and “✕” to denote “motif not supported” and “running out of memory”, respectively. For IS-BT, ES, and EWS, we show their speedup ratios over BT for comparison. We use “*” to mark the results of ES and EWS for instead of .
(a) Q2 on SU
(b) Q3 on SU
(c) Q2 on BC
(d) Q3 on BC
Figure 7: Comparison of the performance of EWS when and
(b) Q1 on AU
(c) Q2 on AU
(d) Q3 on AU
(e) Q4 on AU
(f) Q5 on AU
(g) Q1 on SU
(h) Q2 on SU
(i) Q3 on SU
(j) Q4 on SU
(k) Q5 on SU
(l) Q1 on SO
(m) Q2 on SO
(n) Q3 on SO
(o) Q4 on SO
(p) Q5 on SO
(q) Q1 on BC
(r) Q2 on BC
(s) Q3 on BC
(t) Q4 on BC
(u) Q5 on BC
(v) Q1 on RC
(w) Q2 on RC
(x) Q3 on RC
(y) Q4 on RC
(z) Q5 on RC
Figure 8: Relative error () vs. running time (seconds) with varying sampling probability
(a) Q3 on BC with varying
(b) Q2 on RC with varying
Figure 9: Scalability tests with varying time span and number of temporal edges

The overall performance of each algorithm is reported in Table 3. Here, the time span is set to seconds (i.e., one day) on AU and SU, and seconds (i.e., one hour) on SO, BC, and RC (Note that we use the same values of across all experiments, unless specified). For IS-BT, we report the results in the default setting as indicated in [22], i.e., we fix the interval length to and present the result for the smallest interval sampling probability that can guarantee the relative error is at most . For ES and EWS, we report the results when

by default; in a few cases when the numbers of motif instances are too small or their distribution is highly skewed among edges, we report the results when

(marked with “*” in Table 3) because ES and EWS cannot provide accurate estimates when . In addition, we set to on AU and SU, and on SO, BC, and RC for EWS.

First of all, the efficiencies of EX and 2SCENT are lower than the other algorithms. This is because they use an algorithm for subgraph isomorphism or cycle detection in static graphs for candidate generation without considering temporal information. As a result, a large number of redundant candidates are generated and lead to the degradation in performance. Second, on medium-sized datasets (i.e., AU and SU), ES runs faster than IS-BT in most cases; and meanwhile, their relative errors are close to each other. On large datasets (i.e, SO, BC, and RC), ES demonstrates both much higher efficiency (up to x speedup) and lower estimation errors ( vs. ) than IS-BT. Third, EWS runs x–x faster than ES due to its lower computational cost per edge. The relative errors of ES and EWS are the same on AU and SU because . When , EWS achieves further speedups at the expense of higher relative errors. A more detailed analysis of the effect of is provided in the following paragraph.

Effect of for EWS: In Figure 7, we compare the relative errors and running time of EWS for and when is fixed to . We observe different effects of on medium-sized (e.g., SU) and large (e.g., BC) datasets. On the SU dataset, the benefit of smaller is marginal: the running time decreases slightly but the errors become obviously higher. But on the BC dataset, by setting , EWS achieves x–x speedups without affecting the accuracy seriously. These results imply that temporal wedge sampling is more effective on larger datasets. Therefore, we set on AU and SU, and on SO, BC, and RC for EWS in the remaining experiments.

Accuracy vs. Efficiency: Figure 8 demonstrates the trade-offs between relative error and running time of three sampling algorithms, namely IS-BT, ES, and EWS. For IS-BT, we fix the interval length to and vary the interval sampling probability from to . For ES and EWS, we vary the edge sampling probability from to . First of all, ES and EWS consistently achieve better trade-offs between accuracy and efficiency than IS-BT in almost all experiments. Specifically, ES and EWS can run up to x and x faster than IS-BT when the relative errors are at the same level. Meanwhile, in the same elapsed time, ES and EWS are up to x and x more accurate than IS-BT, respectively. Furthermore, EWS can outperform ES in all datasets except SO because of lower computational overhead. But on the SO dataset, since the distribution of motif instances is highly skewed among edges and thus the temporal wedge sampling leads to large errors in estimation, the performance of EWS degrades significantly and is close to or even worse than that of ES. Nevertheless, the effectiveness of temporal wedge sampling for EWS can still be confirmed by the results on the BC and RC datasets.

Scalability: We evaluate the scalability of different algorithms with varying the time span and dataset size . In both experiments, we use the same parameter settings as used for the same motif on the same dataset in Table 3. We test the effect of for on the BC dataset by varying from h to h. As shown in Figure 9(a), the running time of all algorithms increases near-linearly w.r.t. . BT runs out of memory when h. The relative errors of ES and EWS keep steady for different but the accuracy of IS-BT degrades seriously when increases. This is owing to the increase in cross-interval instances and the skewness of instances among intervals. Meanwhile, ES and EWS run up to x and x faster than IS-BT, respectively, while always having smaller errors. The results for on the RC dataset with varying are presented in Figure 9(b). Here, we vary from M to near M by extracting the first temporal edges of the RC dataset. The running time of all algorithms grows near-linearly w.r.t. . The fluctuations of relative errors of IS-BT explicate that it is sensitive to the skewness of instances among intervals. ES and EWS always significantly outperform IS-BT for different : they run much faster, have smaller relative errors, and provide more stable estimates than IS-BT.

6 Conclusion

In this paper, we studied the problem of approximately counting a temporal motif in a temporal graph via random sampling. We first proposed a generic Edge Sampling (ES) algorithm to estimate the number of any -vertex -edge temporal motif in a temporal graph. Furthermore, we improved the ES algorithm by combining edge sampling with wedge sampling and devised the EWS algorithm for counting -vertex -edge temporal motifs. We provided comprehensive theoretical analyses on the unbiasedness, variances, and complexities of our algorithms. Extensive experiments on several real-world temporal graphs demonstrated the accuracy, efficiency, and scalability of our algorithms. Specifically, ES and EWS ran up to x and x faster than the state-of-the-art sampling method while having lower estimation errors.

References

  • Ahmed et al. [2014] N. K. Ahmed, N. G. Duffield, J. Neville, and R. R. Kompella. Graph sample and hold: a framework for big-graph analytics. In KDD, pages 1446–1455, 2014.
  • Bressan et al. [2017] M. Bressan, F. Chierichetti, R. Kumar, S. Leucci, and A. Panconesi. Counting graphlets: Space vs time. In WSDM, pages 557–566, 2017.
  • De Stefani et al. [2016] L. De Stefani, A. Epasto, M. Riondato, and E. Upfal. TRIÈST: counting local and global triangles in fully-dynamic streams with fixed memory size. In KDD, pages 825–834, 2016.
  • Etemadi et al. [2016] R. Etemadi, J. Lu, and Y. H. Tsin. Efficient estimation of triangles in very large graphs. In CIKM, pages 1251–1260, 2016.
  • Faust [2010] K. Faust. A puzzle concerning triads in social networks: graph constraints and the triad census. Soc. Netw., 32(3):221–233, 2010.
  • Galimberti et al. [2018] E. Galimberti, A. Barrat, F. Bonchi, C. Cattuto, and F. Gullo. Mining (maximal) span-cores from temporal networks. In CIKM, pages 107–116, 2018.
  • Guo et al. [2017] W. Guo, Y. Li, M. Sha, and K. Tan. Parallel personalized pagerank on dynamic graphs. PVLDB, 11(1):93–106, 2017.
  • Guo et al. [2020] W. Guo, Y. Li, M. Sha, B. He, X. Xiao, and K. Tan. GPU-accelerated subgraph enumeration on partitioned graphs. In SIGMOD, pages 1067–1082, 2020.
  • Gurukar et al. [2015] S. Gurukar, S. Ranu, and B. Ravindran. COMMIT: a scalable approach to mining communication motifs from dynamic networks. In SIGMOD, pages 475–489, 2015.
  • Himmel et al. [2016] A. Himmel, H. Molter, R. Niedermeier, and M. Sorge. Enumerating maximal cliques in temporal graphs. In ASONAM, pages 337–344, 2016.
  • Holme and Saramäki [2012] P. Holme and J. Saramäki. Temporal networks. Phys. Rep., 519(3):97–125, 2012.
  • Jain and Seshadhri [2017] S. Jain and C. Seshadhri. A fast and provable method for estimating clique counts using Turán’s theorem. In WWW, pages 441–449, 2017.
  • Jha et al. [2013] M. Jha, C. Seshadhri, and A. Pinar. A space efficient streaming algorithm for triangle counting using the birthday paradox. In KDD, pages 589–597, 2013.
  • Jha et al. [2015] M. Jha, C. Seshadhri, and A. Pinar. Path sampling: A fast and provable method for estimating 4-vertex subgraph counts. In WWW, pages 495–505, 2015.
  • Kolda et al. [2013] T. G. Kolda, A. Pinar, and C. Seshadhri. Triadic measures on graphs: The power of wedge sampling. In SDM, pages 10–18, 2013.
  • Kosyfaki et al. [2019] C. Kosyfaki, N. Mamoulis, E. Pitoura, and P. Tsaparas. Flow motifs in interaction networks. In EDBT, pages 241–252, 2019.
  • Kovanen et al. [2011] L. Kovanen, M. Karsai, K. Kaski, J. Kertész, and J. Saramäki. Temporal motifs in time-dependent networks. J. Stat. Mech.: Theory Exp., 2011(11):P11005, 2011.
  • Kumar and Calders [2018] R. Kumar and T. Calders. 2SCENT: an efficient algorithm to enumerate all simple temporal cycles. PVLDB, 11(11):1441–1453, 2018.
  • Leskovec and Sosič [2016] J. Leskovec and R. Sosič. SNAP: a general-purpose network analysis and graph-mining library. TIST, 8(1):1:1–1:20, 2016.
  • Li et al. [2018] R. Li, J. Su, L. Qin, J. X. Yu, and Q. Dai. Persistent community search in temporal networks. In ICDE, pages 797–808, 2018.
  • Lim and Kang [2015] Y. Lim and U. Kang. MASCOT: memory-efficient and accurate sampling for counting local triangles in graph streams. In KDD, pages 685–694, 2015.
  • Liu et al. [2019] P. Liu, A. R. Benson, and M. Charikar. Sampling methods for counting temporal motifs. In WSDM, pages 294–302, 2019.
  • Mackey et al. [2018] P. Mackey, K. Porterfield, E. Fitzhenry, S. Choudhury, and G. Chin Jr. A chronological edge-driven approach to temporal subgraph isomorphism. In BigData, pages 3972–3979, 2018.
  • Milo et al. [2002] R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, and U. Alon. Network motifs: Simple building blocks of complex networks. Science, 298(5594):824–827, 2002.
  • Namaki et al. [2017] M. H. Namaki, Y. Wu, Q. Song, P. Lin, and T. Ge. Discovering graph temporal association rules. In CIKM, pages 1697–1706, 2017.
  • Pagh and Tsourakakis [2012] R. Pagh and C. E. Tsourakakis. Colorful triangle counting and a mapreduce implementation. Inf. Process. Lett., 112(7):277–281, 2012.
  • Paranjape et al. [2017] A. Paranjape, A. R. Benson, and J. Leskovec. Motifs in temporal networks. In WSDM, pages 601–610, 2017.
  • Pavan et al. [2013] A. Pavan, K. Tangwongsan, S. Tirthapura, and K. Wu. Counting and sampling triangles from a graph stream. PVLDB, 6(14):1870–1881, 2013.
  • Pinar et al. [2017] A. Pinar, C. Seshadhri, and V. Vishal. ESCAPE: efficiently counting all 5-vertex subgraphs. In WWW, pages 1431–1440, 2017.
  • Przulj [2007] N. Przulj. Biological network comparison using graphlet degree distribution. Bioinformatics, 23(2):177–183, 2007.
  • Sanei-Mehri et al. [2018] S. Sanei-Mehri, A. E. Sariyüce, and S. Tirthapura. Butterfly counting in bipartite networks. In KDD, pages 2150–2159, 2018.
  • Sha et al. [2017] M. Sha, Y. Li, B. He, and K. Tan. Accelerating dynamic graph analytics on GPUs. PVLDB, 11(1):107–120, 2017.
  • Tsourakakis et al. [2009] C. E. Tsourakakis, U. Kang, G. L. Miller, and C. Faloutsos. DOULION: counting triangles in massive graphs with a coin. In KDD, pages 837–846, 2009.
  • Turk and Türkoglu [2019] A. Turk and D. Türkoglu. Revisiting wedge sampling for triangle counting. In WWW, pages 1875–1885, 2019.
  • Türkoglu and Turk [2017] D. Türkoglu and A. Turk. Edge-based wedge sampling to estimate triangle counts in very large graphs. In ICDM, pages 455–464, 2017.
  • Ullmann [1976] J. R. Ullmann. An algorithm for subgraph isomorphism. J. ACM, 23(1):31–42, 1976.
  • Uzupyte and Wit [2020] R. Uzupyte and E. C. Wit. Test for triadic closure and triadic protection in temporal relational event data. Social Netw. Analys. Mining, 10, 2020.
  • Varshney et al. [2011] L. R. Varshney, B. L. Chen, E. Paniagua, D. H. Hall, and D. B. Chklovskii. Structural properties of the Caenorhabditis elegansneuronal network. PLoS Comput. Biol., 7(2):1–21, 2011.
  • Viard et al. [2015] J. Viard, M. Latapy, and C. Magnien. Revealing contact patterns among high-school students using maximal cliques in link streams. In ASONAM, pages 1517–1522, 2015.
  • Viard et al. [2016] T. Viard, M. Latapy, and C. Magnien. Computing maximal cliques in link streams. Theor. Comput. Sci., 609:245–252, 2016.
  • Wang et al. [2014] P. Wang, J. C. S. Lui, B. F. Ribeiro, D. Towsley, J. Zhao, and X. Guan. Efficiently estimating motif statistics of large networks. TKDD, 9(2):8:1–8:27, 2014.
  • Wang et al. [2016] P. Wang, J. C. S. Lui, D. F. Towsley, and J. Zhao. Minfer: A method of inferring motif statistics from sampled edges. In ICDE, pages 1050–1061, 2016.
  • Wang et al. [2017a] P. Wang, Y. Qi, Y. Sun, X. Zhang, J. Tao, and X. Guan. Approximately counting triangles in large graph streams including edge duplicates with a fixed memory usage. PVLDB, 11(2):162–175, 2017a.
  • Wang et al. [2018a] P. Wang, J. Zhao, X. Zhang, Z. Li, J. Cheng, J. C. S. Lui, D. Towsley, J. Tao, and X. Guan. MOSS-5: a fast method of approximating counts of 5-node graphlets in large graphs. IEEE Trans. Knowl. Data Eng., 30(1):73–86, 2018a.
  • Wang et al. [2017b] Y. Wang, Q. Fan, Y. Li, and K. Tan. Real-time influence maximization on dynamic social streams. PVLDB, 10(7):805–816, 2017b.
  • Wang et al. [2018b] Y. Wang, Y. Li, J. Fan, and K. Tan. Location-aware influence maximization over dynamic social streams. ACM Trans. Inf. Syst., 36(4):43:1–43:35, 2018b.
  • Wang et al. [2019] Y. Wang, Y. Li, and K. Tan. Semantic and influence aware k-representative queries over social streams. In EDBT, pages 181–192, 2019.
  • Wu et al. [2016] B. Wu, K. Yi, and Z. Li. Counting triangles in large graphs by random sampling. IEEE Trans. Knowl. Data Eng., 28(8):2013–2026, 2016.
  • Zhao et al. [2010] Q. Zhao, Y. Tian, Q. He, N. Oliver, R. Jin, and W. Lee. Communication motifs: a tool to characterize social communications. In CIKM, pages 1645–1648, 2010.