1 Introduction
A common and important task in graph (network) analysis is the frequency of subgraph analysis. This task has two versions: counting nonoverlapping subgraphs (Elhesha and Kahveci (2016)), and counting all subgraphs (Ribeiro et al. (2019)). Such counting is important for multiple theoretical and computational tasks, including among many others: Subgraph isomorphism (Ullmann (1976)), graph classification (Jin et al. (2010)), graph anomaly detection (Papadimitriou et al. (2010)), and clique counting (Bron and Kerbosch (1973)). Within subgraph analysis, a special focus has been given to the frequency of small connected subgraphs, with typically 34 vertices. Such subgraphs are often called graphlets (Ahmed et al. (2015)), or subgraph motifs (Kashtan et al. (2004)). In a directed graph, motifs are required to be connected only in the underlying undirected graph (e.g. , is a motif, although there is no path from to ). A large number of algorithms was proposed to count subgraphs. Some algorithms are specific to a certain type of subgraphs (e.g. cliques (Bron and Kerbosch (1973)), triads (Schank and Wagner (2005)) or stars (Gonen et al. (2011))), but most existing algorithms focus on counting motifs, either directed or undirected (For an excellent recent review see Ribeiro et al. (2019)). Existing methods to count motifs can be divided into four main groups: Full counting methods (e.g. among many others Itzhack et al. (2007); Wernicke and Rasche (2006)) and sampling methods (See Wang et al. (2017); Yang et al. (2018); Bressan et al. (2018) for a few recent examples among many others) for directed and undirected motifs. Beyond those, there are currently over 50 existing algorithms for motif counting. Both sampling and full enumeration methods provide accurate estimates of the total subgraph number for any given motif.
A related work also extensively studied especially in undirected graphs is the count of the subgraphs frequency around each vertex. This frequency is useful for many applications, including topologybased machine learning algorithms (
Naaman et al. (2019)).Formally, local subgraph counting is defined as the number of times a given graph exists around a given vertex. Such local subgraph counting (or local motifs in the case of graphs connected in the underlying undirected graph) has various applications, including among many others: vertex clustering (Yin et al. (2018)), the detection and characterization of dense communities (Tsourakakis (2015)), network characterization and classification (Pržulj (2007); Benami et al. (2019); Naaman et al. (2019)), and also applications to machine learning, where these motifs are used as the input to classification and anomaly detection tasks (Hayes et al. (2013); Akoglu et al. (2015)). Such local motif counts have also been for link prediction (Rossi et al. (2012); Naaman et al. (2019)).
The local subgraphs studied vary among applications. However, many of them focus on undirected connected subgraphs with a predefined number of vertices (further denoted here as undirected , where is the number of vertices). Recently, local subgraph counting algorithms have been extended to large graphs (Ahmed et al. (2016a); Elenberg et al. (2015); Hočevar and Demšar (2014); Melckenbeeck et al. (2016); Ahmed et al. (2016b); Elenberg et al. (2016); Pashanasangi and Seshadhri (2020); Melckenbeeck et al. (2018)Pashanasangi and Seshadhri (2020)). Note that the position of the vertex in the subgraph is usually not taken into consideration. One can define three main approaches to local motif counts in large graphs: decomposition, matrixbased methods and enumeration methods. The first two methods are limited to undirected subgraphs (even if the original graph is directed, one can count undirected subgraph in the undirected graph induced by ignoring the direction of edges in the original graph), while the last can be applied to both directed and undirected subgraphs. Specifically, for a given subgraph type and a graph :
Still, there is currently no efficient enough fully parallel enumeration approach to count directed localmotifs. Moreover, most methods focus on the detection of a single motif, instead of counting all possible motifs simultaneously.
We here provide a solution for efficient and fully parallelized counting of all local subgraphs, and propose an algorithm to count the number of each motif that contains each vertex or edge, named VDMC (Vertex specific Distributed Motif Counting). In the following sections, we first outline the algorithm and then prove the main claims underlying the algorithm. We then provide an analytic estimate of the frequency of motifs in Erdős Rényi graphs (Erdős and Rényi (1960)) and show that VDMC produces the expected motif frequency. We discuss the VDMC performance on large random and real world graphs in both CPU and GPU machines.
2 Novelty
VMDC uses multiple elements, most of which have already been used in different algorithms, but never combined to produce a highly efficient parallel vertexspecific subgraph counting algorithm:

Explicit counting. Explicitly counting the number of each subgraph type containing a given vertex, through a BFS on the undirected graph underlying the directed graph (i.e. the graph produced by ignoring the direction of edges).

Divide and conquer. Applying a divide and conquer approach, by (defacto) removing from the graph each vertex for which we computed all motifs containing it. Each vertex is assigned an index based on its degree. Vertices are ordered, from the highest degree to the lowest, with an arbitrary order between vertices of equal degree. In each subgraph, only motifs with indices higher than the root are counted. The computation of subgraphs is performed in parallel, based on the lowest index in the subgraph. This order makes the parallelization more efficient since low index roots contain more motifs than high index roots. These low index roots are removed from the graph at the beginning which prevents repassing on these ”heavy” roots.Such a balancing optimizes the parallelization efficacy.

Efficient data structure. VDMC uses the cacheaware efficient Compressed Sparse Row(CSR) format (Buluç et al. (2009)) that minimizes the memory cost to the number of edges and allows for cacheaware parallelization.

A set of efficient rules to count each subgraph once and only once, as will be further explained in lemmas bellow.

Ordering possible motifs based on the average distance in the BFS of all vertices. To count all motifs only once, VDMC counts all motifs that have a structure with a low average BFS depth before motifs in structures with a higher BFS depth.

Combining isomorphisms only once at the end of the counting process. During the counting equivalent motifs are counted separately. Only at the end of the enumeration, isomorphic motifs are combined.

GPU implementation to maximize parallelization.
3 Related Work
Beyond the vast number of motif counting algorithm, multiple parallel enumeration algorithms were developed. Those are typically based on either Mapreduce applications, GPU, Shared, or Distributed Memory (SM or DM). Those include (a noncomprehensive list, mainly focused on the GPU based applications): (Wang et al Wang et al. (2005), Schatz et al Schatz et al. (2008), MPRF Liu et al. (2009),Ribeiro and collaborators (multiple algorithms) Ribeiro et al. (2010b, a); Eddin and Ribeiro (2017), Elenberg et al Elenberg et al. (2016), Rossi et al Rossi and Zhou (2016), Ahmed et al Ahmed et al. (2015), Lin et al Lin et al. (2016) and Milinkovich at al Milinković et al. (2015)).
We follow Wang et al. (2005) in distributing vertices to get an equal work share, but instead of using a combination of high and low degree vertices for each processor to better balance the work, we here sort the vertices by their (undirected) degree. In the analysis of each vertex, we compute only the following index vertices. Thus, as the vertex index increase, so does their degree, but in parallel, they do not process vertices of lower index numbers. Our approach is most similar to the work of Lin et al (Lin et al. (2016)). The main extensions here are counting the number of motifs that contain each vertex, to allow for vertexspecific motif counts. Moreover, VDMC uses a highly efficient memory usage formalism to reduce the total memory cost and the access time to memory in both CPU and GPU based applications.
4 Notation and Detailed description of algorithm
4.1 Definitions
We propose an algorithm that counts all motifs once and only once in parallel computation. We first suggest the algorithm notation and describe the algorithm. We then prove that this algorithm counts each motif once and only once. We use the following notations for an unweighted directed graph , where are vertices and are edges, and arbitrary indexing of the vertices, where vertex is the vertex with index . Without loss of generality, we assume that ranges between 1 and .

We denote by the undirected graph underlying produced by ignoring the direction of the edge in .

A subgraph is a if it is composed of vertices, and these vertices are a connected subgraph of .

Each is fully determined by the vertices composing it and can be described as a BFS based tree in (since it is connected), starting from a root , which is one of the vertices. The BFS tree is denoted as a . We denote the depth of an edge as its minimal distance from the root , and the average depth of the motif as the average depth of the vertices composing it. For example, a , with the root connected to three vertices has an average depth of , since the root has a depth of 0, and all other vertices have a depth of 1.

A will be defined to be proper if the index of the root is smaller than all other indices in the .

We denote all with a root index of as .

To classify a
into a specific , we follow Itzhack et al. (2007), and use two sets of values. The first value is different for isomorphs of the same motif, and the second is the lowest value for all isomorphs of the same (See Figure 1 for example). The index of a is based on a bit array of adjacencies. The graph is induced by vertices in any given order, and the edges between them can be represented as a miniature adjacency matrix, with 1 in the position if an edge exists between and. This adjacency matrix is then represented as a binary vector by ordering it by rows, and removing the diagonal (since we assume a simple graph with no self edges). This binary vector is then translated to a base 10 number, which is the index of the motif.
4.2 The Algorithm
Using these notations, and the proof below (in Section 5), we propose the VDMC algorithm:

Sort all vertices in any arbitrary order.

Set the number of motifs of each type (including all different isomorphs) for each vertex to be 0.

Split all to parallelize the analysis.

For each vertex compute all proper and update the motif count for each vertex included in each , as follows.

For each such compute the depth of each vertex in the BFS, order vertices by their depth, and then by their index. (See Figure 2 for example).

Compute the adjacency matrix of this and compute the motif type it creates. For example in Figure 2: 1234, 1267, 1678 are 4motifs of depth 0.75, 1 and 1.5 respectively.

Increase the counter for each vertex in the for this motif type by 1.



Sum over all isomorphs of each motif (See further description below).
To further improve the algorithm efficacy, we use in the C++ kernel the highly efficient CSR format for sparse graphs. This format increases the code efficacy by leveraging the computer’s internal cache mechanism. The CSR is composed of two arrays:

A consecutive ordered list of all the ordered lists of each vertices neighbors (Neighbors), maintained as a single array.

The starting index of each vertex’s neighbor list in the Neighbor array (Indices).
To understand the CSR format, here is a simple example. Assume the following graph: ().

If the graph is directed, is: [0, 3, 3, 4, 6], and is: [1, 2, 3, 0, 1, 2].

If the graph is undirected, is: [0, 3, 5, 7, 10], and is: [1, 2, 3, 0, 3, 0, 3, 0, 1, 2].
The CSR object is designed around the principle of cacheawareness. When accessing the graph in the C++ code, the aim is to accelerate the computations by loading sections of the graph into the cache ahead of time for quick access. This comes into effect in the BFS, when we access the blocks of neighbor vertices in the Neighbors vector, and pulling the entire list of neighbors of a certain vertex into the cache.
During the motif enumeration a separate count is used for different isomorphs of the same motif. For example, the motif in figure 1 can be counted as 53 (110101) or as 30 (011110). Eventually, all isomorphs of the same motif are summed to the ones with the lowest index (either in realtime, or at the end of the analysis).
5 Proof that each motif can be counted once and only once in parallel
Lemma 1 If all and only proper are counted, and within each each is counted once and only once, then each is counted once and only once.
Proof are fully determined by the vertices composing them. For any set of vertices (), we can denote the minimal index as . Since these vertices are connected in , such a will be counted within each . However, only is proper, and thus the motif composed of will be counted in and only there. Since it will be counted there once and only once, each will be counted once and only once.
Conclusion. Counting all motifs can be performed by counting only all proper , and can be parallelized by counting in parallel all motifs in all proper .
Lemma 2 In , for a given , there are only 2 possible structures for the and for the there are only 4 possible , as detailed in figure 2.
Proof. This can be seen by enumerating all possibilities at the 1st level (2 for , and 3 for ), and then the resulting possible combinations in lower levels.
Remark. Each structure has a different average depth and can be defined by its depth.
Before introducing the next lemma, let us follow a given . Such a has an average depth. A vertex in a can only have vertices of depth 1 or 2 (beyond the root). A vertex in a can be of depth 1,2 (with two variations) or 3 (Figure 2). In the example in Figure 2, the set of vertices can be counted six times in a of depth 0.75 (, , , etc.) twice in a of depth 1 ( and ), and twice in a of depth 1.5 ( and ). To prevent, the double counting within the same level, one can simply determine that within the same level, the BFS should always follow the index number (i.e. not count ). To prevent all counts of the , one can simply determine that no edges in the BFS are allowed from a given depth to a lower or equal depth. Thus, if we do not count motifs that point within the same level from a high to a low count, and from a higher level to a lower or equal level, we will ensure that will be counted only once. This can be stated through the following lemma.
Lemma 3 If one does not count in a given structure with a direction from a higher depth to a lower or equal depth in the BFS, and among vertices of the same depth, vertices are only considered in order of their index, then no will be counted twice in the same .
Proof Given a root and other vertices . Each vertex is assigned a single depth, which is its minimal depth (i.e. if a vertex is both a first and second neighbor of the root, it is assigned the minimal depth, which is ). The two rules above induce a full order between all vertices in (depth and index). Each motif in is counted only following the order defined by the rules above, as such it will be counted only once.
Note Lemma 3 is meant to eliminate the case of counting twice in the motifs. This does not ensure that each motif is counted at least once, as will be further shown. However, we can explicitly state the conditions for not to be counted and address that.
Lemma 4 Any not counted by the rules above is of average depth (4 vertices in a row) and is part of a loop containing the uncounted motif and a single extra external vertex.
Proof Each is per definition contained in a proper . It will not be counted only if it is ignored following the conditions in Lemma 3. It cannot be ignored following two vertices of the same depth but in the wrong order, since the with the opposite order would be counted. Thus, the only condition that a is not counted would be an edge in the BFS from a higher to a lower or equal depth vertex. Let us denote these vertices (lower depth  closer to root) and (higher depth farther away from the root). cannot be of depth 0, per definition (since otherwise, the root would appear twice in the ). Thus, is at least of depth . Similarly, cannot be of depth more than 2, since any vertex at depth 3 is the last vertex in a and does not point to anything. and cannot be of depth 1 in the same as explained by the symmetry argument above. Thus, must be of depth 2 and must be of depth 1 or 2. if would have been of depth 1, the proposed be counted in the average depth 1 , leaving the last option of both and being of depth 2. They cannot be from the same (since they would be either with a common depth 1 ancestor, and then the symmetry argument above would hold, or from different ancestors, but then one would require at least 5 vertices in the . One is only left with the choice that is of depth 2 but through a vertex not in the current .
Comment It is beyond the scope of the current manuscript, but a not accounted will contain a circle with a vertex outside this specific BFS but with the same root. The length of such a loop in the case of a is exactly 5. Therefore we know that the problematic vertex was marked at both depths 2 and 3 exactly.
The counting algorithm takes this case into account and counts a motif even if the last node is marked of depth 2, as long as it is not from the same . This actually simplifies the algorithm, since, for the last vertex, one must not check if it has any neighbor of depth 1, but rather if he is directly connected to the edge at depth in the current BFS.
6 Vertex ordering and splitting over the GPU
As mentioned above (in the definition of a proper kBFS), each vertex is associated with a removal order. To achieve better performance, vertices are ordered in reverse order of the vertices’ degrees, i.e. the vertex with the highest degree was associated with the lowest index, and the vertex with the lowest degree was associated with the highest index. This way, the motifs for the vertices with the highest degrees are calculated at the beginning and these vertices are removed from the graph. Hence there is no repassing on these ”heavy” vertices.
Realworld networks often have scalefree degree distribution, and as such may be computationally expensive. VDMC can handle very high degrees through the division of the for high degree vertices into parallel computations. To parallelize the analysis, VDMC is adapted to GPU to analyze large blocks of vertices in parallel. Each pair of a vertex and one of its neighbors is computed separately. Dividing the vertices on the GPU even by their neighbors was done to make the parallelization more efficient by making the blocks’ tasks more equal. It prevents a situation where the algorithm waits only for a small number of vertices with a very high degree while the calculations for the other vertices have been completed.
The following description is slightly technical, and can be skipped:
Twodimensional blocks were created and arranged in a grid. Each block contained 16 threads for each dimension. The grid was also defined to be twodimensional in order to calculate each pair of vertex and its neighbor in a separate block. The dimensions of the grid are and each of them was limited to blocks (This number may be modified according to the datasets and the memory available on the GPU). For datasets in which the number of blocks was larger than the defined limit, the excess vertices were divided again starting from the first block. That is, the BFS for the ith vertex and its jth neighbor was calculated in the block whose indices are .This ensures the blocks’ tasks to be even and a maximal parallelization of the analysis over the GPU threads.
7 Comparison to Theory in
To test the accuracy of VDMC, we compared the observed and expected number of each motif in random graphs (also commonly named Erdős Rényi graphs). We define a as a graph with
vertices, where each pair of vertices is connected with a constant probability
, independent on all edges or vertices. Let be the number of of index that contain vertex . We compute the expected value of within an arbitrary ( ).Denote as the set of all combinations of vertices that contain vertex . For a specific combination of vertices , we denote the event ”The induced subgraph from is a motif of type ” as , the number of isomorphs of a motif of type as , and the number of edges in a motif of type as . The maximal number of edges within vertices is if the graph is undirected and twice that number otherwise. The indicator function is denoted as
can be represented as sum of events.
(1) 
leading to:
(2) 
The probability in the right hand side is independent on , since all edges are independent (up to the approximation of ignoring overlapping sets, which is a good approximation for large enough graphs), leading to:
(3) 
The probability that a specific motif appears in a specific set of vertices is just the sum of probabilities over all the isomorphs of , that the specific isomorph appears. This probability is equal for all isomorphs, since all edges are independent, leading to:
(4) 
The estimate of is immediate and can be obtained by directly counting permutations.
To test the fit between Eq. 4 and VDMC, we built for multiple and used , then compared our motif calculations to the theory. Figure 5 shows the log of the expected (internal bar) and observed (external bar) motif frequencies for graphs with 1,000 vertices. As one can see, the expected and observed values are equal (Chi square test non significant at the p=0.05 level for all motifs).
We have further performed extensive validations on both random graphs and small toygraphs where the frequency of each motif can be computed analytically (e.g. cliques, regular Directed Acyclic Graphs (DAG), etc. ) and VDMC is always accurate. All examples are detailed in the GIT.
8 Algorithm efficacy  simulated graphs
The cost of the proposed algorithm is for and for , which is proportional to the total number of motifs, where is the average number of third neighbors. For an Erdős Rényi graph, this is closely equivalent to . However, for fattailed distributions, this can be much higher. Note that the efficient implementation ensures that the constants in this analysis are not large. In practice, the computational cost is determined by the implementation. The computational cost of the same algorithm in C++ is approximately 10 times more efficient than its parallel in Python (Fig 4 and 5).
Moreover, following the massive parallelization, the cost of GPU based application is not sensitive to the number of vertices, and not to the average degree, as long as not all threads are filled. Thus, up to thousands of vertices and hundreds of thousands of edges, the cost of GPU based subgraph enumeration can be treated as constant. Even for larger realworld graphs, the cost of the GPU application is of reasonable cost, as further detailed. Note that there is an initial cost to accessing GPU. Thus, for small graphs, the computational cost of GPU based applications is in practice higher than CPU based applications.
9 Algorithm efficacy  real world networks
Simulated graphs typically differ from real world networks. Real world networks often have scale free degree distribution, and as such may be computationally expensive. VDMC can handle very high degrees through the division of the for high degree vertices into parallel computations, as mentioned in the GPU implementation. Several real datasets were used to test our algorithm and to compare it to previous algorithms. These datasets and their properties are summarized in table number 1. The current state of the art for local subgraph counting is through local homomorphism counting, and process local homomorphism counting by natural joins with groupby and aggregation in a distributed system DISC on top of Spark (see Zhang et al. (2020) for more details). The elapsed time for each algorithm for 3 and may be found in table number 2. VDMC is typically 510 time slower than DISC, but VDMC is run on a single Tesla V100SXM232GB card, while DISC is run on a 16 machine Spark system. Moreover, DISC only count undirected motifs, and VDMC counts directed motifs.
DatasetsProperties  Notation  —V—  —E—  Is Directed 


webBerkStan  WBD  True  
WB  False  
asSkitter  AS  False  
socLiveJournal  LJD  True  
LJ  False  
comOrkut  OK  False 
10 Other tools available in the current toolbox
The CSR format allows for efficient computation of multiple features, beyond the motif counting discussed here, we developed multiple other simpler measures using the same formalism, including:

KCores  the maximal subgraph, where each vertex has a degree of at least in the subgraph (Dorogovtsev et al. (2006)).

The normalized distance distribution for each vertex (i.e. the fraction of vertices with a distance of from a given vertex.

Attraction Bassin hierarchy (Muchnik et al. (2007)). Average neighbor degree

Page Rank (Page et al. (1999)).

Flow  a hierarchy measure that approximates topological sorting for graphs with cycles(Rosen and Louzoun (2014)).
All these measures are available in the Github with the motif counting algorithm.
11 Discussion
We have presented a highly efficient distributed algorithm to count vertex participation in 3 and 4 motifs, and shown the accuracy and efficacy of the algorithm. The memory cost is simply the number of edges, and the CPU cost is precisely the number of motifs counted since each motif is counted once and only once. This algorithm contains an efficient cache aware data structure. The same could be extended to counting motifs for edges, rather than vertices. This change is minimal and only requires updating edges and not vertices once a motif was counted. The proposed algorithm can also be easily distributed among different GPUs/CPUs, by simply sending chunks of vertices in the root of the BFS to different GPUs/CPUs. Finally, while we have proposed an algorithm for 3 and 4 motifs, the current claims and data structure are appropriate for 5 motifs too.
This algorithm has many possible applications, including among many others, subgraph isomorphism algorithms, topologybased graph machine learning, as well as descriptive tools for network analysis. As the networks studied grow exponentially, such tools are getting critical to enlarge existing network topology methods to larger graphs.
The main caveat of the current method is its limitation to a given motif size, in contrast with methods that have been developed to detect any specific subgraph ( Kashtan et al. (2004); Kashani et al. (2009); Wernicke and Rasche (2006); Melckenbeeck et al. (2018); Song et al. (2015); Koskas et al. (2011)). However, the enumeration of all subgraph a given size requires only one pass on each set of vertices that can compose such a motif. As such, the current code is typically as efficient for all motifs as for the task of finding a specific subgraph of a given size.
References
 (1)
 Ahmed et al. (2015) Nesreen K Ahmed, Jennifer Neville, Ryan A Rossi, and Nick Duffield. 2015. Efficient graphlet counting for large networks. In 2015 IEEE International Conference on Data Mining. IEEE, 1–10.
 Ahmed et al. (2016a) Nesreen K Ahmed, Theodore L Willke, and Ryan A Rossi. 2016a. Estimation of local subgraph counts. In 2016 IEEE International Conference on Big Data (Big Data). IEEE, 586–595.
 Ahmed et al. (2016b) Nesreen K Ahmed, Theodore L Willke, and Ryan A Rossi. 2016b. Exact and estimation of local edgecentric graphlet counts. In Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications. 1–17.
 Akoglu et al. (2015) Leman Akoglu, Hanghang Tong, and Danai Koutra. 2015. Graph based anomaly detection and description: a survey. Data mining and knowledge discovery 29, 3 (2015), 626–688.
 Benami et al. (2019) Idan Benami, Keren Cohen, Oved Nagar, and Yoram Louzoun. 2019. Topological based classification of paper domains using graph convolutional networks. arXiv preprint arXiv:1904.07787 (2019).
 Bressan et al. (2018) Marco Bressan, Flavio Chierichetti, Ravi Kumar, Stefano Leucci, and Alessandro Panconesi. 2018. Motif counting beyond five nodes. ACM Transactions on Knowledge Discovery from Data (TKDD) 12, 4 (2018), 1–25.
 Bron and Kerbosch (1973) Coen Bron and Joep Kerbosch. 1973. Algorithm 457: finding all cliques of an undirected graph. Commun. ACM 16, 9 (1973), 575–577.
 Buluç et al. (2009) Aydin Buluç, Jeremy T Fineman, Matteo Frigo, John R Gilbert, and Charles E Leiserson. 2009. Parallel sparse matrixvector and matrixtransposevector multiplication using compressed sparse blocks. In Proceedings of the twentyfirst annual symposium on Parallelism in algorithms and architectures. 233–244.
 Dave et al. (2017) Vachik S Dave, Nesreen K Ahmed, and Mohammad Al Hasan. 2017. ECLoG: counting edgecentric local graphlets. In 2017 IEEE International Conference on Big Data (Big Data). IEEE, 586–595.
 Dorogovtsev et al. (2006) Sergey N Dorogovtsev, Alexander V Goltsev, and Jose Ferreira F Mendes. 2006. Kcore organization of complex networks. Physical review letters 96, 4 (2006), 040601.
 Eddin and Ribeiro (2017) Ahmad Naser Eddin and Pedro Ribeiro. 2017. Scalable subgraph counting using MapReduce. In Proceedings of the Symposium on Applied Computing. 1574–1581.
 Elenberg et al. (2015) Ethan R Elenberg, Karthikeyan Shanmugam, Michael Borokhovich, and Alexandros G Dimakis. 2015. Beyond triangles: A distributed framework for estimating 3profiles of large graphs. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 229–238.
 Elenberg et al. (2016) Ethan R Elenberg, Karthikeyan Shanmugam, Michael Borokhovich, and Alexandros G Dimakis. 2016. Distributed estimation of graph 4profiles. In Proceedings of the 25th International Conference on World Wide Web. 483–493.
 Elhesha and Kahveci (2016) Rasha Elhesha and Tamer Kahveci. 2016. Identification of large disjoint motifs in biological networks. BMC bioinformatics 17, 1 (2016), 408.
 Erdős and Rényi (1960) Paul Erdős and Alfréd Rényi. 1960. On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci 5, 1 (1960), 17–60.
 Gonen et al. (2011) Mira Gonen, Dana Ron, and Yuval Shavitt. 2011. Counting stars and other small subgraphs in sublineartime. SIAM Journal on Discrete Mathematics 25, 3 (2011), 1365–1411.
 Hayes et al. (2013) Wayne Hayes, Kai Sun, and Nataša Pržulj. 2013. Graphletbased measures are suitable for biological network comparison. Bioinformatics 29, 4 (2013), 483–491.
 Hočevar and Demšar (2014) Tomaž Hočevar and Janez Demšar. 2014. A combinatorial approach to graphlet counting. Bioinformatics 30, 4 (2014), 559–565.
 Hočevar and Demšar (2017) Tomaž Hočevar and Janez Demšar. 2017. Combinatorial algorithm for counting small induced graphs and orbits. PloS one 12, 2 (2017), e0171428.
 Itzhack et al. (2007) Royi Itzhack, Yelena Mogilevski, and Yoram Louzoun. 2007. An optimal algorithm for counting network motifs. Physica A: Statistical Mechanics and its Applications 381 (2007), 482–490.

Jin et al. (2010)
Ning Jin, Calvin Young,
and Wei Wang. 2010.
GAIA: graph classification using evolutionary computation. In
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. 879–890.  Kashani et al. (2009) Zahra Razaghi Moghadam Kashani, Hayedeh Ahrabian, Elahe Elahi, Abbas NowzariDalini, Elnaz Saberi Ansari, Sahar Asadi, Shahin Mohammadi, Falk Schreiber, and Ali MasoudiNejad. 2009. Kavosh: a new algorithm for finding network motifs. BMC bioinformatics 10, 1 (2009), 318.
 Kashtan et al. (2004) Nadav Kashtan, Shalev Itzkovitz, Ron Milo, and Uri Alon. 2004. Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs. Bioinformatics 20, 11 (2004), 1746–1758.
 Koskas et al. (2011) Michel Koskas, Gilles Grasseau, Etienne Birmelé, Sophie Schbath, and Stéphane Robin. 2011. NeMo: Fast count of network motifs. Book of Abstracts for Journées Ouvertes Biologie Informatique Mathématiques (JOBIM) 2011 (2011), 53–60.
 Lin et al. (2016) Wenqing Lin, Xiaokui Xiao, Xing Xie, and XiaoLi Li. 2016. Network motif discovery: A GPU approach. IEEE transactions on knowledge and data engineering 29, 3 (2016), 513–528.
 Liu et al. (2009) Yang Liu, Xiaohong Jiang, Huajun Chen, Jun Ma, and Xiangyu Zhang. 2009. Mapreducebased pattern finding algorithm applied in motif detection for prescription compatibility network. In International Workshop on Advanced Parallel Processing Technologies. Springer, 341–355.
 Melckenbeeck et al. (2018) Ine Melckenbeeck, Pieter Audenaert, Didier Colle, and Mario Pickavet. 2018. Efficiently counting all orbits of graphlets of any order in a graph using autogenerated equations. Bioinformatics 34, 8 (2018), 1372–1380.
 Melckenbeeck et al. (2016) Ine Melckenbeeck, Pieter Audenaert, Tom Michoel, Didier Colle, and Mario Pickavet. 2016. An algorithm to automatically generate the combinatorial orbit counting equations. PloS one 11, 1 (2016), e0147078.
 Milinković et al. (2015) Aleksandar Milinković, Stevan Milinković, and Ljubomir Lazić. 2015. A contribution to acceleration of graphlet counting. In Infoteh Jahorina Symposium, Vol. 14. 741–745.
 Muchnik et al. (2007) Lev Muchnik, Royi Itzhack, Sorin Solomon, and Yoram Louzoun. 2007. Selfemergence of knowledge trees: Extraction of the Wikipedia hierarchies. Physical review E 76, 1 (2007), 016106.
 Naaman et al. (2019) Roi Naaman, Keren Cohen, and Yoram Louzoun. 2019. Edge sign prediction based on a combination of network structural topology and sign propagation. Journal of Complex Networks 7, 1 (2019), 54–66.
 Page et al. (1999) Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The PageRank citation ranking: Bringing order to the web. Technical Report. Stanford InfoLab.
 Papadimitriou et al. (2010) Panagiotis Papadimitriou, Ali Dasdan, and Hector GarciaMolina. 2010. Web graph similarity for anomaly detection. Journal of Internet Services and Applications 1, 1 (2010), 19–30.
 Park et al. (2016) HaMyung Park, SungHyon Myaeng, and U Kang. 2016. Pte: Enumerating trillion triangles on distributed systems. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1115–1124.
 Pashanasangi and Seshadhri (2020) Noujan Pashanasangi and C Seshadhri. 2020. Efficiently counting vertex orbits of all 5vertex subgraphs, by evoke. In Proceedings of the 13th International Conference on Web Search and Data Mining. 447–455.
 Pržulj (2007) Nataša Pržulj. 2007. Biological network comparison using graphlet degree distribution. Bioinformatics 23, 2 (2007), e177–e183.
 Ribeiro et al. (2019) Pedro Ribeiro, Pedro Paredes, Miguel EP Silva, David Aparicio, and Fernando Silva. 2019. A Survey on Subgraph Counting: Concepts, Algorithms and Applications to Network Motifs and Graphlets. arXiv preprint arXiv:1910.13011 (2019).
 Ribeiro et al. (2010a) Pedro Ribeiro, Fernando Silva, and Luis Lopes. 2010a. Efficient parallel subgraph counting using gtries. In 2010 IEEE International Conference on Cluster Computing. IEEE, 217–226.
 Ribeiro et al. (2010b) Pedro Manuel Pinto Ribeiro, Fernando MA Silva, and Luís MB Lopes. 2010b. Parallel Calculation of Subgraph Census in Biological Networks.. In Bioinformatics. 56–65.
 Rosen and Louzoun (2014) Yonatan Rosen and Yoram Louzoun. 2014. Directionality of real world networks as predicted by path length in directed and undirected graphs. Physica A: Statistical Mechanics and its Applications 401 (2014), 118–129.

Rossi et al. (2012)
Ryan A Rossi, Luke K
McDowell, David William Aha, and
Jennifer Neville. 2012.
Transforming graph data for statistical relational
learning.
Journal of Artificial Intelligence Research
45 (2012), 363–441.  Rossi and Zhou (2016) Ryan A Rossi and Rong Zhou. 2016. Leveraging multiple gpus and cpus for graphlet counting in large networks. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. 1783–1792.
 Schank and Wagner (2005) Thomas Schank and Dorothea Wagner. 2005. Finding, counting and listing all triangles in large graphs, an experimental study. In International workshop on experimental and efficient algorithms. Springer, 606–609.
 Schatz et al. (2008) Michael Schatz, Elliott CooperBalis, and Adam Bazinet. 2008. Parallel network motif finding. Techinical report, University of Maryland Insitute for Advanced Computer Studies (2008).
 Song et al. (2015) Xiaoli Song, Changjun Zhou, Bin Wang, and Qiang Zhang. 2015. A Method of Motif Mining Based on Backtracking and Dynamic Programming. In International Workshop on Multidisciplinary Trends in Artificial Intelligence. Springer, 317–328.
 Tsourakakis (2015) Charalampos Tsourakakis. 2015. The kclique densest subgraph problem. In Proceedings of the 24th international conference on world wide web. 1122–1132.
 Ullmann (1976) Julian R Ullmann. 1976. An algorithm for subgraph isomorphism. Journal of the ACM (JACM) 23, 1 (1976), 31–42.
 Wang et al. (2017) Pinghui Wang, Junzhou Zhao, Xiangliang Zhang, Zhenguo Li, Jiefeng Cheng, John CS Lui, Don Towsley, Jing Tao, and Xiaohong Guan. 2017. MOSS5: A fast method of approximating counts of 5node graphlets in large graphs. IEEE Transactions on Knowledge and Data Engineering 30, 1 (2017), 73–86.
 Wang et al. (2005) Tie Wang, Jeffrey W Touchman, Weiyi Zhang, Edward B Suh, and Guoliang Xue. 2005. A parallel algorithm for extracting transcriptional regulatory network motifs. In Fifth IEEE Symposium on Bioinformatics and Bioengineering (BIBE’05). IEEE, 193–200.
 Wang et al. (2019) Zhaokang Wang, Rong Gu, Weiwei Hu, Chunfeng Yuan, and Yihua Huang. 2019. BENU: Distributed Subgraph Enumeration with BacktrackingBased Framework. In 2019 IEEE 35th International Conference on Data Engineering (ICDE). IEEE, 136–147.
 Wernicke and Rasche (2006) Sebastian Wernicke and Florian Rasche. 2006. FANMOD: a tool for fast network motif detection. Bioinformatics 22, 9 (2006), 1152–1153.
 Yang et al. (2018) Chen Yang, Min Lyu, Yongkun Li, Qianqian Zhao, and Yinlong Xu. 2018. SSRW: a scalable algorithm for estimating graphlet statistics based on random walk. In International Conference on Database Systems for Advanced Applications. Springer, 272–288.
 Yin et al. (2018) Hao Yin, Austin R Benson, and Jure Leskovec. 2018. Higherorder clustering in networks. Physical Review E 97, 5 (2018), 052306.
 Zhang et al. (2020) Hao Zhang, Jeffrey Xu Yu, Yikai Zhang, Kangfei Zhao, and Hong Cheng. 2020. Distributed Subgraph Counting: A General Approach. Proc. VLDB Endow. 13, 12 (July 2020), 2493–2507. https://doi.org/10.14778/3407790.3407840
12 Appendix I  GPU Changes
To further accelerate the motif counting code, a GPU based version was developed using the CUDA library. In the parallel version, the frequencies of all motifs for each are computed in parallel. This requires running each of those calculations independently of the others. Given the lemmas proved above, each can be computed separately.
The following modifications were required for the GPU implementation:

GPU functions are written outside the main bodies of code (MotifCalculator,MotifUtils, CacheGraph), as they cannot be member functions. They are instead rewritten as standalone functions within the file.

Global variables are used to communicate between the class methods, which provide the CPUside pre and postprocessing of the results, and the GPU functions.

Atomic add is used to update the motif counters, so that the GPU threads won’t interfere with one another.

All data which is used in the GPU must be copied into GPU memory. Additionally, the data is prefetched asynchronously to improve the code’s performance.
To run the GPU code, some requirements must be met, in both software and hardware.

The GPU must be an NVIDIA GPU of computing capability 3.5 or higher.

Version 8.0 or higher of the CUDA Toolkit must be installed and in the PATH of the system.

The GPU drivers must be compatible with the CUDA Toolkit.
Specifically, the code was tested on a system with the following specs:

Ubuntu 20.04.2 LTS Linux distribution

A Tesla V100SXM232GB with compute capability 7.0

Each of the following CUDA Toolkit versions: 8.0, 9.2, 10.1, 10.2

GPU driver versions 396.26 and 440.33.01