1 Introduction
Subgraph analysis in massive graphs is a fundamental task that arises in numerous applications, including social network analysis [1], uncovering network motifs (repetitive subgraphs) in gene regulatory networks in bioinformatics [2], indexing graph databases [3], optimizing task scheduling in infrastructure monitoring, and detecting events in cybersecurity. Many emerging applications often require one to solve the subgraph counting problem for very large instances.
Given two graphs—a subgraph on vertices (also referred to as a template), and a graph on vertices and edges as input, some of the commonly studied questions related to the subgraph analysis include: 1) Subgraphexistence: determining whether contains a subgraph that is isomorphic to template , 2) Subgraphcounting: counting the number of such subgraphs, 3) Frequentsubgraphs: finding subgraphs that occur frequently in , and 4) Graphlet frequency: computing the frequency Distribution (GFD) of a set of templates , i.e. for each template count the number of occurrences of in in . Some of the commonly studied templates are paths and trees, and we focus on the detection and counting versions of the noninduced subgraph isomorphism problem for paths and trees (these problems are formally defined in Section 2
). Tree template counting can also be used as a kernel to estimate the GFD in a graph. For instance, Bressan et al.
[4] show that a wellimplemented tree template counting kernel can push the limit of the stateoftheart GFD in terms of input graph size and template size. These problems are recognized to be NPhard even for very simple templates, and the best algorithms for computing exact counts of a vertices template from a vertices graph has a complexity of [5].This motivates the use of approximation algorithms, and several techniques have been developed for subgraph counting problems. These have been based on the idea of fixed parameter tractable algorithms, whose execution time is exponential in the template size but polynomial in the number of vertices, —this is one of the standard approaches for dealing with NPhard problems (see [6, 7]). Two powerful classes of techniques that have been developed for subgraph counting are: 1) colorcoding [8], which was the first fixed parameter technique for counting paths and trees, with a running time and space and , respectively, and 2) multilinear detection [9, 10], which is based on a reduction of the subgraph detection problem to detecting multilinear terms in multivariate polynomials. This approach reduces the time and space to and , respectively. However, finding the actual subgraphs requires additional work [11].
Our focus is on parallel algorithms and implementations for subgraph detection and counting. Parallel versions of both the colorcoding [12, 13] and multilinear detection techniques have been developed [14]. Though the multilinear detection technique has several benefits in terms of time and space over colorcoding, it is still more involved (than colorcoding) when it comes to finding the subgraph embeddings. Additionally, the colorcoding technique has been extended to subgraphs beyond trees, specifically, those with treewidth more than 1—such subgraphs are much more complex than trees and can contain cycles [15]. Therefore, efficient parallelization of the colorcoding technique remains a very useful objective, and will be the focus of our work.
The current parallel algorithms for colorcoding have been either implemented with MapReduce (SAHAD [12]) or with MPI (FASCIA [13]). However, both methods suffer from significant communication overhead and large memory footprints, which prevents them from scaling to templates with more than 12 vertices. We focus on the problem of counting tree templates (referred to as treelets), identify the bottlenecks of scaling, and design a new approach for parallelizing colorcoding. We aim to address the following computation challenges:

Communication: Many graph applications are based on pointtopoint communication, having the unavailability of highlevel communication abstraction that is adaptive for irregular graph interactions.

Load balance: Sparsity of graph leads to load imbalance of computation.

Memory: High volume of intermediate data, due to large subgraph template (big model), causes intranode high peak memory utilization at runtime.
We investigate computing capabilities to run subgraph counting at a very large scale, and we propose the following solutions:

AdaptiveGroup communication with a datadriven pipeline design to interleave computation with communication.

Partitioning neighbor list for finegrained task granularity to alleviate load imbalance at thread level within a single node.

Intermediate data partitioning with sliced and overlapped workload in the pipeline to reduce peak memory utilization.
We compare our results with the stateoftheart MPI Fascia implementation [13] and show applicability of the proposed method by counting large treelets (up to 15 vertices) on massive graphs (up to 5 billion edges and 0.66 billion vertices).
The rest of the chapter is organized as follows. Section 2 introduces the problem, colorcoding algorithm and scaling challenges. Section 3 presents our approach on AdaptiveGroup communication as well as a neighbor list partitioning technique at thread level. Section 4 contains experimental analysis of our proposed methods and the performance improvements. Related works and our conclusion could be found in Section 5 and 6.
2 Background
Let denote a graph on the set of nodes and set of edges. We say that a graph is a noninduced subgraph of if and . We note that there may be other edges in among the nodes in in an induced embedding. A template graph is said to be isomorphic to a noninduced subgraph of if there exists a bijection such that for each edge , we have . In this case, we also say that is a noninduced embedding of .
Subgraphcounting problem. Given a graph and a template , our objective is to count the number of noninduced embeddings of in , which is denoted by .
Given , we say that a randomized algorithm gives an approximation if . In this chapter, we will focus on obtaining an approximation to .
2.1 The Colorcoding Technique
Colorcoding is a randomized approximation algorithm, which estimates the number of treelike embeddings in with a tree size and a constant . We briefly describe the key ideas of the colorcoding technique here, since our algorithm involves a parallelization of it.
Counting colorful embeddings. The main idea is that if we assign a color to each node , “colorful” embeddings, namely those in which each node has a distinct color, can be counted easily in a bottomup manner.
For a tree template , let denote its root, which can be picked arbitrarily. Then denote a template with root .
Let and denote
the subtrees by cutting edge from . We pick
and .
Let denote the number of colorful embeddings of with vertex
mapped to the root , and using the color set , where .
Then, we can compute using dynamic programming with the following recurrence.
(1) 
where accounts for the overcounting, as discussed in [8].
Figure 1 (a) shows how the problem is decomposed into smaller subproblems. In this partition process, an arbitrary vertex is picked up as the root which is marked in red, then one edge of it is removed, splitting tree into two small subtrees. The arrow lines denote these split relationships, with the solid line pointing to the subtree with the root vertex and dotted line to the other. This process runs recursively until the tree template has only one vertex, . Figure 1 (b) shows an example of the colorful embedding counting process which demonstrates the calculation on one neighbor of the root vertex. Here, tree template is split into sub templates and , in order to count , or the number of embeddings of rooted at , using color set , we enumerate over all valid combination of sub color sets on and . For , we have and , and for , we have , . As can be constructed by combinations of these sub trees, equals to the summation of the multiplication of the count of the sub trees, and results in . In this example, the combination of two subtrees of uniquely locates a colorful embedding. But for some templates, some subtrees are isomorphic to each other when the root is removed. E.g., for in Figure 1 (a), the same embedding will be overcounted for 2 times in this dynamic programming process.
Random coloring. The second idea is that if the coloring is done randomly with
colors, there will be a reasonable probability that an embedding is colorful, i.e., each of its nodes is marked by a distinct color. Specifically, an embedding
of is colorful with probability . Therefore, the expected number of colorful embeddings is . Alon et al. [8]show that this estimator has bounded variance, which can be used to efficiently estimate the number of embeddings, denoted as
. Algorithm 1 describes the sequential colorcoding algorithm.(2) 
(3) 
Distributed colorcoding and challenges. As colorcoding runs independent estimates in the outer loop at line 4 in the sequential Algorithm 1, it is straightforward to implement the outer loop at line 4 in a parallel way. However, if a large dataset cannot fit into the memory of a single node, the algorithm must partition the dataset over multiple nodes and parallelize the inner loop at line 8 of Algorithm 1 to exploit computation horsepower from more cluster nodes. Nevertheless, vertices partitioned on each local node requires count information of their neighbor located on remote cluster nodes, which brings communication overhead that compromises scaling efficiency. Algorithm 2
uses a collective alltoall operation to communicate count information among processes and updates the counts of local vertices at line 16. This standard communication pattern ignores the impact of growing template size, which exponentially increases communication cost and reduces the parallel scaling efficiency. Moreover, skewed distribution of neighbor vertices on local cluster nodes will generally cause workload imbalance among processes and produce a “straggler” to slow down the collective communication operation. Finally, it requires each local node to hold all the transferred count information in memory before starting the computation stage on the remote data, resulting in a high peak memory utilization on a single cluster node and becoming a bottleneck in scaling out the distributed colorcoding algorithm.
3 Scaling Distributed Colorcoding
To address the challenges analyzed in Section 2, we propose a novel nodelevel communication scheme named AdaptiveGroup in Section 3.2, and a finegrained threadlevel optimization called neighbor list partitioning in Section 3.3. Both of the approaches are implemented as a subgraph counting application to our open source project HarpDAAL [16][17].
3.1 HarpDAAL
HarpDAAL is an ongoing effort of running dataintensive workloads on HPC clusters and becoming a High Performance Computing Enhanced Apache Big Data Stack (HPCABDS). Being extended from Apache Hadoop, HarpDAAL provides users of MPIlike programming model besides the default MapReduce paradigm. Unlike Apache Hadoop, HarpDAAL utilizes the main memory rather than the hard disk to store intermediate results, and it implements a variety of collective communication operations optimized for dataintensive machine learning workloads
[18][19][20]. Furthermore, HarpDAAL provides hardwarespecific acceleration via an integration of Intel’s Data Analytics and Acceleration Library (Intel DAAL) [21] for intranode computation workloads. Intel DAAL is an open source project, and we contribute the optimization codes of this work as reusable kernels of Intel DAAL.3.2 AdaptiveGroup Communication
AdaptiveGroup is an interprocess communication scheme based on the concept of communication group. Given parallel computing processes, each process belongs to a communication group where it has data dependencies, i.e., sending/receiving data, with other processes in the group. In an alltoall operation, such as MPI_Alltoall, each process communicates data with all the other processes in a collective way, namely all processes are associated to a single communication group with size . In AdaptiveGroup communication, the collective communication is divided into steps, where each process only communicates with processes belonging to a communication group of size at each step . The size and the number of total steps are both configurable onthefly and adaptive to computation overhead, load balance, and memory utilization of irregular problems like subgraph counting.
A routing method is required to guarantee that no missing and redundant data transfer occurs during all the steps. Figure 2 illustrates such a routing method, where the alltoall operation among 5 processes is decoupled into 4 steps, and each process only communicates with two other processes within a communication group of size 3 at each step. Line 3 to 14 of Algorithm 3 gives out the pseudo code of AdaptiveGroup communication that implements the routing method in Figure 2. Here the communication is adaptive to the template size . With a large template size , the algorithm adopts the routing method in Figure 2 with a communication group size of 3, while it switches to the traditional alltoall operation if the template size is small.
3.2.1 Pipeline Design
When adding up all steps in AdaptiveGroup, we apply a pipeline design shown in Figure 3, which includes a computation pipeline (red) and a communication pipeline (blue). Given an AdaptiveGroup communication in steps, each pipeline follows stages to finish all the work. The first stage is a cold start, where no previous received data exists in the computation pipeline and only the communication pipeline is transferring data. For the following stage, the work in the communication pipeline can be interleaved by the work in the computation pipeline. This interleaving can be achieved by using a multithreading programming model, where a single thread is in charge of the communication pipeline and the other threads are assigned to the computation pipeline (see Algorithm 3 line 5 to 13). Since at each stage the computation pipeline relies on the data received at the previous stage of the communication pipeline, a synchronization of two pipelines at the end of each stage is required (shown as a dashed line in Figure 3). The additional performance brought by pipeline depends on the ratio of overlapping computation and communication in each stage of two pipelines. We will estimate the bounds of computation and communication in pipeline design for large templates through an analysis of complexity.
3.2.2 Complexity Analysis
When computing subtree , we estimate the computation complexity on remote neighbors at step as:
(4) 
where is the number of colors, is the size of subtree in template , and is a subtree partitioned from according to Algorithm 1.
We divide the neighbors of into local neighbors and remote neighbors . The is made up of neighbors received in each step, . With the assumption of random partitioning by vertices across processes,
(5) 
where is the edge number. Further by applying Chernoff bound, we have with probability at least .
Therefore, we get the bound of computation as
(6) 
Similarly, the expectation of peak memory utilization at step is
(7) 
where is the length of array (memory space) that holds the combination of color counts for each , and its complexity is bounded by (refer to line 8 of Algorithm 1).
The communication complexity at step by Hockney model [22] is
(8) 
where is the latency associated to the operations in step , is the data transfer time per byte, and is the time spent by process in waiting for other processes because of the load imbalance among processes at step , which is bounded by
(9) 
where is the execution time of process at step which is expressed as
(10) 
When it comes to the total complexity of all steps, we assume a routing algorithm described in Figure 2 is used, where . We obtain the bound for computation as
(11) 
While the peak memory utilization is
(12) 
The total communication overhead in the pipeline design of all steps is calculated by
(13) 
where is defined as the ratio of effectively overlapped communication time by computation in a pipeline step
(14) 
As the computation per neighbor for is bounded by and communication data volume per bounded by its memory space complexity , increases faster than with respect to the template size . Therefore, for large templates, the computation term is generally larger than the communication overhead at each step, and we have . Equation 13 is bounded by
(15) 
With large , we have
(16) 
which is inversely proportional to . The third term in Equation 15 is also inversely proportional to . Therefore shall decrease with an increasing , which implies that the algorithm is scalable with large templates by bounding the communication overhead.
For small templates, there is usually no sufficient workload to interleave communication overhead, which gives a relatively small value in Equation 13 and compromises the effectiveness of pipeline interleaving. Even worse, as the transferred data at each step is small, it cannot effectively leverage the bandwidth of interconnect when compared to the alltoall operation. In such cases, the AdaptiveGroup is able to switch back to alltoall mode and ensure a good performance.
3.2.3 Implementation
We implement the pipelined AdaptiveGroup communication with Harp, where a mapper plays the role of a parallel process, and mappers can complete various collective communications that are optimized for big data problems. In implementation like MPI_Alltoall, each process out of prepares a slot Slot(q) for any other process that it communicates with, and pushes data required by to Slot(q) prior to the collective communication. The ID label of sender and receiver are attached to the slots in a static way, and the program must choose a type of collective operation (e.g., alltoall, allgather) in the compilation stage.
In contrast, each Harp mapper keeps a sender queue and a receiving queue, and each data packet is labeled by a meta ID as shown in Figure 4. For AdaptiveGroup, the meta ID for each packet consists of three parts (bitwise packed to a 32bit integer): the sender mapper ID, the receiver mapper ID, and the offset position in the sending queue. A userdefined routing algorithm then decodes the meta ID and delivers the packet in a dynamicallyconfigurable way. The routing algorithm is able to detect template and workload sizes, and switch onthefly between pipeline and alltoall modes.
3.3 Finegrained Load Balance
For an input graph with a high skewness in outdegree of vertex, colorcoding imposes a load imbalance issue at the threadlevel. In Algorithm 1 and 2, the task of computing the counts of a certain vertex by looping all entries of its neighbor list is assigned to a single thread. If the max degree of an input graph is several orders of magnitude larger than the average degree, one thread may take orders of magnitude more workload than average. For large templates, this imbalance is amplified by the exponential increase of computing counts for a single vertex in line 9 of Algorithm 1.
To address the issue of workload skewness, we propose a neighbor list partitioning technique, which is implemented by the multithread programming library OpenMP. Algorithm 4 illustrates the process of creating the finegrained tasks assigned to threads. Given maximal task size , the process detects the neighbor list length of a vertex . If is beyond , it extracts a sublist sized out of the neighbors and creates a task including neighbors in the sublist associated to vertex . The same process applies to the remaining part of the truncated list until all neighbors are partitioned. If is already smaller than , it creates a task with all the neighbors associated to vertex .
The neighbor list partitioning ensures that no extremely large task is assigned to a thread by bounding the task size to , which improves the workload balance at threadlevel. However, it comes with a race condition if two threads are updating tasks associated to the same vertex . We use atomic operations of OpenMP to resolve the race condition and shuffle the created task queue at line 17 of Algorithm 4 to mitigate the chance of conflict.
4 Evaluation of Performance and Analysis of Results
4.1 Experimentation Setup
We conduct a set of experiments by implementing 4 code versions of distributed colorcoding algorithm with HarpDAAL: Naive, Pipeline, Adaptive and AdaptiveLB (Load Balance). Table 1 lists individual optimization technique for experiments. They aim to systematically investigate the impact of each optimization, which addresses the sparse irregularity, the low computation to communication ratio or the high memory footprint issues of subgraph counting.
Implementation  Communication Mode  Adaptive Switch  Neighbor list partitioning 

Naive  alltoall  Off  Off 
Pipeline  Pipeline  Off  Off 
Adaptive  alltoall/pipeline  On  Off 
AdaptiveLB  alltoall/pipeline  On  On 
Data  Vertices  Edges  Avg Deg  Max Deg  Source  Abbreviation 

Miami  2.1M  51M  49  9868  social network  MI 
Orkut  3M  230M  76  33K  social network  OR 
NYC  18M  480M  54  429  social network  NY 
44M  2B  50  3M  Twitter users  TW  
Sk2005  50M  3.8B  73  8M  UbiCrawler  SK 
Friendster  66M  5B  57  5214  social network  FR 
RMAT250M(k=1,3,8)  5M  250M  100,102,217  170,40K,433K  PaRMAT  R250K1,3,8 
RMAT500M(k=3)  5M  500M  202  75K  PaRMAT  R500K3 
Template  Memory Complexity  Computation Complexity  Computation Intensity 

u31  3  6  2 
u52  25  70  2.8 
u72  147  434  2.9 
u102  1047  5610  5.3 
u121  4082  24552  6.0 
u122  3135  38016  12 
u13  4823  109603  22 
u14  7371  242515  32 
u151  12383  753375  60 
u152  15773  617820  39 
We use synthetic and real datasets in our experiments which are summarized in Table 2. Miami, Orkut [23][24][25], Twitter [26], SK2005 [27], and Friendster [24] are datasets generated by real applications. RMAT synthetic datasets are generated by the RMAT model [28] by specifying the size and skewness. Specifying a higher skewness generates a highly imbalanced distribution of outdegree for input graph datasets. Therefore, we can use different skewness of RMAT datasets to study the impact of unbalanced workload on the performance. The different sizes and structures of the tree templates used in the experiments are shown in Figure 5, where templates from u31 to u122 are collected from [13], while u13 to u15 are the largest tree subgraphs being tested to date.
We observe that the size and shape of subtemplates affect the ratio of computation and communication in our experiments. This corresponds to code line 8 of Algorithm 1, where each subtemplate is partitioned into trees and . The space complexity for each neighbor is bounded by when computing subtemplate , and is proportional to the communication data volume. The computation, which depends on the shape of the template, is bounded by . In Table 3, the memory space complexity is denoted as , and the computation complexity is . In this chapter, we define the computation intensity as the ratio of computation versus communication (or space) for a template in Figure 5. For example, the computation intensity generally increases along with the template size from u31 to u152. However, for the same template size, template u122 has a computation intensity of 12 while u121 only has 6. We will use these definitions and refer to their values when analyzing the experiment results in the rest of sections.
All experiments run on an Intel Xeon E5 cluster with 25 nodes. Each node is equipped with two sockets of Xeon E5 2670v3 (212 cores), and 120 GB of DDR4 memory. We use all 48 threads by default in our tests, and InfiniBand is enabled in either Harp or the MPI communication library. Our HarpDAAL codes are compiled by JDK 8.0 and Intel ICC Compiler 2016 as recommended by Intel. The MPIFascia [13] codes are compiled by OpenMPI 1.8.1 as recommended by its developers.
4.2 Scaling by AdaptiveGroup Communication
We first conduct a baseline test with the naive implementation of distributed colorcoding. When the subgraph template size is scaled up as shown in Figure 6, we have the following observations: 1) For small template u52, computation decreases by 2x when scaling from 4 to 8 nodes while communication only increases by 13%. 2) For large template u122, doubling cluster nodes only reduces computation time by 1.5x but communication grows by 5x. It implies that the alltoall communication within the Naive implementation does not scale well on large templates.
To clarify the effectiveness of the HarpDAAL Pipeline on large templates, Figure 7 compares strong scaling speedup, total execution time, and ratio of communication/computation time between the Naive and Pipeline implementation versions on Dataset R500K3, which has skewness similar to real application datasets such as Orkut. For template u102, HarpDAAL Pipeline only slightly outperforms HarpDAAL Naive in terms of speedup and total execution time. However, for u122, this performance gap increases to 2.3x (8 nodes) and 2.7x (10 nodes) in execution time, and the speedup is significantly improved starting from 8 nodes. The result is consistent with Table 3, where u122 has 2 times higher computation intensity than u102, which provides the pipeline design of sufficient workload to interleave the communication overhead. The ratio charts of Figure 7 also confirm this result that HarpDAAL Pipeline has more than 65% of computation on 8 and 10 nodes, while the computation ratio for HarpDAAL Naive is below 50% when scaling on 8 and 10 nodes. Although template u121 has the same size as template u122, it only has half of the computation intensity as shown in Table 3. According to Equation 13, the low computation intensity on u121 reduces the overlapping ratio , and we find in Figure 8 that HarpDAAL Pipeline has less than 10% of overlapping ratio for u121, while u122 keeps around 30% when scaling up to 10 cluster nodes.
For small templates similar to u31 and u52 which have low computation intensities, we shall examine the effectiveness of adaptability in HarpDAAL Adaptive, where the code switches to alltoall mode. In Figure 9, we did the strong scaling tests with small templates u31 and u52. Results show that when compared to HarpDAAL Pipeline, HarpDAAL Adaptive has a better speedup for tests of both u31 and u52 on three large datasets: Twitter, Sk2005, and Friendster. Also, the poor performance of HarpDAAL Pipeline is due to the low overlapping ratio in Figure 8 for Twitter, Sk2005, and Friendster, where drops to near zero quickly after scaling to more than 15 nodes.
In addition to strong scaling, we present weak scaling tests in Figure 10 for template u122. We generate a group of RMAT datasets with skewness 3 and an increasing number of vertices and edges proportional to the running cluster nodes. By fixing the workload on each cluster node, the weak scaling on the HarpDAAL Pipeline reflects the additional communication overhead when more cluster nodes are used. For the HarpDAAL Pipeline, execution time grows only by 20% with cluster nodes growing by 2 (from 4 nodes to 8 nodes). From the ratio chart in Figure 10, it is also clear that the Naive implementation has its communication ratio increased to more than 50% by using 8 cluster nodes while the communication ratio of Pipeline implementation stays under 40%.
4.3 Finegrained Load Balance
Although AdaptiveGroup communication and pipeline design mitigate the nodelevel load imbalance caused by skewness of neighbor list length for each vertex in input graph, it can not resolve finegrained workload imbalance at threadlevel inside a node. By applying our neighbor list partitioning technique, we compare the performance of HarpDAAL AdaptiveLB with HarpDAAL Adaptive on datasets with different skewness. In Figure 11, we first compare the datasets with increasing skewness shown in Table 2. With R250K1 and MI having small skewness, the neighbor list partitioning barely gains any advantage, and its benefit starts to appear from dataset OR by 2x improvement of the execution time. For a dataset with high skewness such as R250K8 with u122 template, this acceleration achieves up to 9x the execution time as shown in Figure 11.
When scaling threads from 6 to 48, for dataset MI having small skewness, the execution time does not improve much. For R250K8, HarpDAAL AdaptiveLB maintains a good performance compared to Naive implementation. In particular, the threadlevel performance of HarpDAAL Naive drops down after using more than physical core number (24) of threads, which implies a suffering from hyper threading. However, HarpDAAL AdaptiveLB is able to keep the performance unaffected by hyper threading. To further justify the thread efficiency of HarpDAAL AdaptiveLB, we measure the thread concurrency by VTune. The histograms show the distribution of execution time by the different numbers of concurrently running threads. For dataset MI, the number of average concurrent threads of HarpDAAL Naive and AdaptiveLB are close (22 versus 28) because the dataset MI does not have severe load imbalance caused by skewness. For dataset R250K8, the number of average concurrent threads of HarpDAAL AdaptiveLB outperforms that of HarpDAAL Naive by around 2x (40 versus 18).
Finally, we study the granularity of task size and how it affects partitioning of the neighbor list. In Algorithm 4, each task of updating neighbor list is bounded by a selected size . If is too small, there will be a substantial number of created tasks, which adds additional thread scheduling and synchronization overhead. If is too large, it can not fully exploit the benefits of partitioning neighbor list. There exists a range of task granularity which can be observed in the experiments on R250K3 and R250K8. To fully leverage the neighbor list partitioning, a task size between 40 and 60 gives better performance than the other values.
4.4 Peak Memory Utilization
AdaptiveGroup communication and pipeline design also reduce the peak memory utilization at each node. According to Equation 12, peak memory utilization depends on two terms: the from local vertices and from remote neighbors . When total of dataset is fixed, decreases with increasing process number and thus reduces the first peak memory term. The second term associated with at step is also decreasing along with because more steps () leads to small data volume involved in each step. In Figure 12, we observe this reduction of peak memory utilization along with the growing number of cluster nodes from 4 to 10. Compared to HarpDAAL Naive, HarpDAAL Pipeline reduces the peak memory utilization by 2x on 4 nodes, and this saving grows to around 5x for large templates u102, u121, and u122.
4.5 Overall Performance
Figure 13 shows a comparison of HarpDAAL AdaptiveLB versus MPIFasica in total execution time with growing templates on Twitter dataset. For small templates u31, u52, and u72, HarpDAAL AdaptiveLB performs comparably or slightly better. Small templates can not fully exploit the efficiency of pipeline due to low computation intensity. For large template u102, HarpDAAL AdaptiveLB achieves 2x better performance than MPIFascia, and it continues to gain by 5x better performance for u122. Beyond u122, HarpDAAL AdaptiveLB can still scale templates up to u152. MPIFascia can not run templates larger than u122 on Twitter because of high peak memory utilization over the 120 GB memory limitation per node.
Figures 14 and 15 further compare the strong scaling results between HarpDAAL AdaptiveLB and MPIFascia. Scaling from 8 nodes to 16 nodes, HarpDAAL AdaptiveLB achieves better speedup than MPIFascia for templates growing from u31 to u122. MPIFascia cannot run Twitter on 8 nodes due to its high peak memory utilization. The ratio charts in Figure 14 give more details about the speedup, where MPIFascia has a comparable communication overhead ratio in execution time for small templates u31 and u52; however, the communication ratio increases to 80% at template u102 while HarpDAAL AdaptiveLB keeps communication ratio around 50%. At template u122, HarpDAAL AdaptiveLB further reduces the communication overhead to around 40% because the adaptivegroup and pipeline favors large templates with high computation intensity.
5 Related Work
Subgraphs of size with an independent set of size can be counted in time roughly through matrix multiplication based methods [5, 29]. There is substantial work on parallelizing the colorcoding technique. ParSE[30] is the first distributed algorithm based on colorcoding that scales to graphs with millions of vertices with treelike template size up to 10 and hourlevel execution time . SAHAD [12] expands this algorithm to labeled templates of up to 12 vertices on a graph with 9 million of vertices within less than an hour by using a Hadoopbased implementation. FASCIA [31, 32, 13] is the stateoftheart colorcoding treelet counting tool. By highly optimized data structure and MPI+OpenMP implementation, it supports tree template of size up to vertices in billionedge networks in a few minutes. Recent work [15] also explores the topic of a more complex template with tree width 2, which scales up to vertices for graphs of up to vertices. The original colorcoding technique has been extended in various ways, e.g., a derandomized version [33], and to other kinds of subgraphs.
6 Conclusion
Subgraph counting is a NPhard problem with many important applications on large networks. We propose a novel pipelined communication scheme for finding and counting large tree templates. The proposed approach simultaneously addresses the sparse irregularity, the low computation to communication ratio and high memory footprint, which are difficult issues for scaling of complex graph algorithms. The methods are aimed at large subgraph cases and use approaches that make the method effective as graph size, subgraph size, and parallelism increase. Our implementation leverages the HarpDAAL framework adaptively and improves the scalability by switching the communication modes based on the size of subgraph templates. Finegrained load balancing is achieved at runtime with thread level parallelism. We demonstrate that our proposed approach is particularly effective on irregular subgraph counting problems and problems with large subgraph templates. For example, it can scale up to the template size of 15 vertices on Twitter datasets (half a billion vertices and 2 billion edges) while achieving 5x speedup over the stateofart MPI solution. For datasets with high skewness, the performance improves up to 9x in execution time. The peak memory utilization is reduced by a factor of 2 on large templates (12 to 15 vertices) compared to existing work. Another successful application has templates of 12 vertices and a massive input Friendster graph with 0.66 billion vertices and 5 billion edges. All experiments ran on a 25 node cluster of Intel Xeon (Haswell 24 core) processors. Our source code of subgraph counting is available in the public github domain of Harp project[17].
In future work, we can apply this HarpDAAL subgraph counting approach to other dataintensive irregular graph applications such as random subgraphs and obtain scalable solutions to the computational, communication and load balancing challenges.
Acknowledgments
We gratefully acknowledge generous support from the Intel Parallel Computing Center (IPCC) grant, NSF OCI114932 (Career: Programming Environments and Runtime for Data Enabled Science), CIFDIBBS 143054: Middleware and High Performance Analytics Libraries for Scalable Data Science, NSF EAGRER grant, NSF Bigdata grant and DTRA CNIMS grant. We appreciate the support from IU PHI, FutureSystems team and ISE Modelling and Simulation Lab.
References
 [1] X. Chen and J. C. S. Lui, “Mining graphlet counts in online social networks,” in ICDM, 2016, pp. 71–80.
 [2] R. Milo, S. ShenOrr, S. Itzkovitz, N. Kashtan, D. Chklovskii, and U. Alon, “Network motifs: simple building blocks of complex networks,” Science, vol. 298, no. 5594, p. 824, 2002.
 [3] A. Khan, N. Li, X. Yan, Z. Guan, S. Chakraborty, and S. Tao, “Neighborhood based fast graph search in large networks,” in SIGMOD, New York, NY, USA, 2011, pp. 901–912.
 [4] M. Bressan, F. Chierichetti, R. Kumar, S. Leucci, and A. Panconesi, “Counting Graphlets: Space vs Time,” in WSDM, 2017, pp. 557–566.
 [5] V. Vassilevska and R. Williams, “Finding, minimizing, and counting weighted subgraphs,” in STOC, 2009, pp. 455–464.
 [6] J. Flum and M. Grohe, “The parameterized complexity of counting problems,” SIAM Journal on Computing, vol. 33, no. 4, pp. 892–922, 2004.
 [7] R. Curticapean and D. Marx, “Complexity of counting subgraphs: Only the boundedness of the vertexcover number counts,” in FOCS. IEEE, 2014, pp. 130–139.
 [8] N. Alon, R. Yuster, and U. Zwick, “Colorcoding,” J. ACM, vol. 42, no. 4, pp. 844–856, Jul. 1995.
 [9] I. Koutis, “Faster algebraic algorithms for path and packing problems,” in Proc. ICALP, 2008.
 [10] R. Williams, “Finding paths of length k in o∗(k2) time,” Information Processing Letters, vol. 109, no. 6, pp. 315–318, 2009.
 [11] A. Bjorklund, P. Kaski, and L. Kowalik, “Fast witness extraction using a decision oracle,” in Proc. ESA, 2014.
 [12] Z. Zhao, G. Wang, A. R. Butt, M. Khan, V. A. Kumar, and M. V. Marathe, “Sahad: Subgraph analysis in massive networks using hadoop,” in IPDPS, 2012, pp. 390–401.
 [13] G. M. Slota and K. Madduri, “Parallel colorcoding,” Parallel Computing, vol. 47, pp. 51–69, 2015.
 [14] S. Ekanayake, J. Cadena, U. Wickramasinghe, and A. Vullikanti, “Midas: Multilinear detection at scale,” in Proc. IPDPS, 2018.
 [15] V. T. Chakaravarthy, M. Kapralov, P. Murali, F. Petrini, X. Que, Y. Sabharwal, and B. Schieber, “Subgraph Counting: Color Coding Beyond Trees,” in IPDPS, May 2016, pp. 2–11.
 [16] L. Chen, B. Peng, B. Zhang, T. Liu, Y. Zou, L. Jiang, R. Henschel, C. Stewart, Z. Zhang, E. Mccallum, T. Zahniser, O. Jon, and J. Qiu, “Benchmarking HarpDAAL: High Performance Hadoop on KNL Clusters,” in IEEE Cloud, Honolulu, Hawaii, US, Jun. 2017.
 [17] Indiana University, “HarpDAAL official website,” https://dscspidal.github.io/harp, 2018, online; Accessed: 20180121.
 [18] B. Zhang, Y. Ruan, and J. Qiu, “Harp: Collective communication on Hadoop,” in IC2E, 2015, pp. 228–233.
 [19] B. Zhang, B. Peng, and J. Qiu, “High performance LDA through collective model communication optimization,” Procedia Computer Science, vol. 80, pp. 86–97, 2016.
 [20] B. Peng, B. Zhang, L. Chen, M. Avram, R. Henschel, C. Stewart, S. Zhu, E. Mccallum, L. Smith, T. Zahniser, J. Omer, and J. Qiu, “HarpLDA+: Optimizing latent dirichlet allocation for parallel efficiency,” in 2017 IEEE International Conference on Big Data (Big Data), Dec. 2017, pp. 243–252.
 [21] Intel Corporation, “The Intel Data Analytics Acceleration Library (Intel DAAL),” https://github.com/intel/daal, 2018, online; accessed 20180121.
 [22] R. W. Hockney, “The communication challenge for MPP: Intel paragon and meiko CS2,” vol. 20, no. 3, pp. 389–398.
 [23] C. L. Barrett, R. J. Beckman, M. Khan, V. S. A. Kumar, M. V. Marathe, P. E. Stretz, T. Dutta, and B. Lewis, “Generation and analysis of large synthetic social contact networks,” in WSC, Dec. 2009, pp. 1003–1014.
 [24] J. Leskovec and A. Krevl, “SNAP Datasets: Stanford large network dataset collection,” http://snap.stanford.edu/data, Jun. 2014.
 [25] J. Yang and J. Leskovec, “Defining and Evaluating Network Communities Based on GroundTruth,” in ICDM, Dec. 2012, pp. 745–754.
 [26] M. Cha, H. Haddadi, F. Benevenuto, and K. P. Gummadi, “Measuring User Influence in Twitter: The Million Follower Fallacy.” in ICWSM, vol. 14, Jun. 2010.
 [27] T. A. Davis and Y. Hu, “The University of Florida Sparse Matrix Collection,” ACM Trans. Math. Softw., vol. 38, no. 1, pp. 1:1–1:25, Dec. 2011.
 [28] D. Chakrabarti, Y. Zhan, and C. Faloutsos, “RMAT: A recursive model for graph mining,” in SIAM, vol. 6, Apr. 2004.
 [29] M. Kowaluk, A. Lingas, and E.M. Lundell, “Counting and detecting small subgraphs via equations and matrix multiplication,” in SODA, 2011, pp. 1468–1476.
 [30] Z. Zhao, M. Khan, V. A. Kumar, and M. V. Marathe, “Subgraph enumeration in large social contact networks using parallel color coding and streaming,” in ICPP, 2010, pp. 594–603.
 [31] G. M. Slota and K. Madduri, “Fast approximate subgraph counting and enumeration,” in ICPP, 2013, pp. 210–219.
 [32] ——, “Complex network analysis using parallel approximate motif counting,” in IPDPS, 2014, pp. 405–414.
 [33] N. Alon and S. Gutner, “Balanced families of perfect hash functions and their applications,” ACM Trans. Algorithms, vol. 6, no. 3, pp. 54:1–54:12, Jul. 2010.
Comments
There are no comments yet.